summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-01-26mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addressesPaolo Abeni
Currently the in-kernel PM arbitrary enforces that created subflow's family must match the main MPTCP socket while the RFC allows mixing IPv4 and IPv6 subflows. This patch changes the in-kernel PM logic to create subflows matching the currently selected source (or destination) address. IPv4 sockets can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as long as the source and destination matches. A helper, previously introduced is used to ease family matching checks, taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269 Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26icmp: Add counters for rate limitsJamie Bainbridge
There are multiple ICMP rate limiting mechanisms: * Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec * v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask * v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask However, when ICMP output is limited, there is no way to tell which limit has been hit or even if the limits are responsible for the lack of ICMP output. Add counters for each of the cases above. As we are within local_bh_disable(), use the __INC stats variant. Example output: # nstat -sz "*RateLimit*" IcmpOutRateLimitGlobal 134 0.0 IcmpOutRateLimitHost 770 0.0 Icmp6OutRateLimitHost 84 0.0 Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com> Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26Merge branch 'adding-sparx5-is0-vcap-support'Paolo Abeni
Steen Hegelund says: ==================== Adding Sparx5 IS0 VCAP support This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware Processor) support for the Sparx5 platform. The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that mainly extracts frame information to metadata that follows the frame in the Sparx5 processing flow all the way to the egress port. The IS0 VCAP has 4 lookups and they are accessible with a TC chain id: - chain 1000000: IS0 Lookup 0 - chain 1100000: IS0 Lookup 1 - chain 1200000: IS0 Lookup 2 - chain 1300000: IS0 Lookup 3 - chain 1400000: IS0 Lookup 4 - chain 1500000: IS0 Lookup 5 Each of these lookups have their own port keyset configuration that decides which keys will be used for matching on which traffic type. The IS0 VCAP has these traffic classifications: - IPv4 frames - IPv6 frames - Unicast MPLS frames (ethertype = 0x8847) - Multicast MPLS frames (ethertype = 0x8847) - Other frame types than MPLS, IPv4 and IPv6 The IS0 VCAP has an action that allows setting the value of a PAG (Policy Association Group) key field in the frame metadata, and this can be used for matching in an IS2 VCAP rule. This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP. The linking is exposed by using the TC "goto chain" action with an offset from the IS2 chain ids. As an example a "goto chain 8000001" will use a PAG value of 1 to chain to a rule in IS2 Lookup 0. ==================== Link: https://lore.kernel.org/r/20230124104511.293938-1-steen.hegelund@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add support for IS0 VCAP CVLAN TC keysSteen Hegelund
This adds support for parsing and matching on the CVLAN tags in the Sparx5 IS0 VCAP. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add support for IS0 VCAP ethernet protocol typesSteen Hegelund
This allows the IS0 VCAP to have its own list of supported ethernet protocol types matching what is supported by the VCAPs port lookup classification. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add automatic selection of VCAP rule actionsetSteen Hegelund
With more than one possible actionset in a VCAP instance, the VCAP API will now use the actions in a VCAP rule to select the actionset that fits these actions the best possible way. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPsSteen Hegelund
This allows rules to be chained between VCAP instances, e.g. from IS0 Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2 Lookups. Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the hardware. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add TC support for IS0 VCAPSteen Hegelund
This enables the TC command to use the Sparx5 IS0 VCAP Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add actionset type id information to ruleSteen Hegelund
This adds the actionset type id to the rule information. This is needed as we now have more than one actionset in a VCAP instance (IS0). Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5Steen Hegelund
This adds the IS0 VCAP port keyset configuration for Sparx5 and also updates the debugFS support to show the keyset configuration. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP modelSteen Hegelund
This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5. This VCAP provides classification actions for Sparx5. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-25Merge branch 'add-ip_local_port_range-socket-option'Jakub Kicinski
Jakub Sitnicki says: ==================== Add IP_LOCAL_PORT_RANGE socket option This patch set is a follow up to the "How to share IPv4 addresses by partitioning the port space" talk given at LPC 2022 [1]. Please see patch #1 for the motivation & the use case description. Patch #2 adds tests exercising the new option in various scenarios. Documentation ------------- Proposed update to the ip(7) man-page: IP_LOCAL_PORT_RANGE (since Linux X.Y) Set or get the per-socket default local port range. This option can be used to clamp down the global local port range, defined by the ip_local_port_range /proc interface described below, for a given socket. The option takes an uint32_t value with the high 16 bits set to the upper range bound, and the low 16 bits set to the lower range bound. Range bounds are inclusive. The 16-bit values should be in host byte order. The lower bound has to be less than the upper bound when both bounds are not zero. Otherwise, setting the option fails with EINVAL. If either bound is outside of the global local port range, or is zero, then that bound has no effect. To reset the setting, pass zero as both the upper and the lower bound. Interaction with SELinux bind() hook ------------------------------------ SELinux bind() hook - selinux_socket_bind() - performs a permission check if the requested local port number lies outside of the netns ephemeral port range. The proposed socket option cannot be used change the ephemeral port range to extend beyond the per-netns port range, as set by net.ipv4.ip_local_port_range. Hence, there is no interaction with SELinux, AFAICT. RFC -> v1 RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/ * Allow either the high bound or the low bound, or both, to be zero * Add getsockopt support * Add selftests Links: ------ [1]: https://lpc.events/event/16/contributions/1349/ ==================== Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25selftests/net: Cover the IP_LOCAL_PORT_RANGE socket optionJakub Sitnicki
Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios: 1. pass invalid values to setsockopt 2. pass a range outside of the per-netns port range 3. configure a single-port range 4. exhaust a configured multi-port range 5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT) 6. set then get the per-socket port range Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25inet: Add IP_LOCAL_PORT_RANGE socket optionJakub Sitnicki
Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25net: Kconfig: fix spellosRandy Dunlap
Fix spelling in net/ Kconfig files. (reported by codespell) Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: coreteam@netfilter.org Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25net: ethtool: fix NULL pointer dereference in pause_prepare_data()Vladimir Oltean
In the following call path: ethnl_default_dumpit -> ethnl_default_dump_one -> ctx->ops->prepare_data -> pause_prepare_data struct genl_info *info will be passed as NULL, and pause_prepare_data() dereferences it while getting the extended ack pointer. To avoid that, just set the extack to NULL if "info" is NULL, since the netlink extack handling messages know how to deal with that. The pattern "info ? info->extack : NULL" is present in quite a few other "prepare_data" implementations, so it's clear that it's a more general problem to be dealt with at a higher level, but the code should have at least adhered to the current conventions to avoid the NULL dereference. Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net: ethtool: fix NULL pointer dereference in stats_prepare_data()Vladimir Oltean
In the following call path: ethnl_default_dumpit -> ethnl_default_dump_one -> ctx->ops->prepare_data -> stats_prepare_data struct genl_info *info will be passed as NULL, and stats_prepare_data() dereferences it while getting the extended ack pointer. To avoid that, just set the extack to NULL if "info" is NULL, since the netlink extack handling messages know how to deal with that. The pattern "info ? info->extack : NULL" is present in quite a few other "prepare_data" implementations, so it's clear that it's a more general problem to be dealt with at a higher level, but the code should have at least adhered to the current conventions to avoid the NULL dereference. Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25Merge branch 's390-ism-generalized-interface'David S. Miller
Jan Karcher says: ==================== drivers/s390/net/ism: Add generalized interface Previously, there was no clean separation between SMC-D code and the ISM device driver.This patch series addresses the situation to make ISM available for uses outside of SMC-D. In detail: SMC-D offers an interface via struct smcd_ops, which only the ISM module implements so far. However, there is no real separation between the smcd and ism modules, which starts right with the ISM device initialization, which calls directly into the SMC-D code. This patch series introduces a new API in the ISM module, which allows registration of arbitrary clients via include/linux/ism.h: struct ism_client. Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of dependencies on SMC-D in the device structure), and adds a number of API calls for data transfers via ISM (see ism_register_dmb() & friends). Still, the ISM module implements the SMC-D API, and therefore has a number of internal helper functions for that matter. Note that the ISM API is consciously kept thin for now (as compared to the SMC-D API calls), as a number of API calls are only used with SMC-D and hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls. v1 -> v2: Removed s390x dependency which broke config for other archs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: De-tangle ism and smc device initializationStefan Raspl
The struct device for ISM devices was part of struct smcd_dev. Move to struct ism_dev, provide a new API call in struct smcd_ops, and convert existing SMCD code accordingly. Furthermore, remove struct smcd_dev from struct ism_dev. This is the final part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25s390/ism: Consolidate SMC-D-related codeStefan Raspl
The ism module had SMC-D-specific code sprinkled across the entire module. We are now consolidating the SMC-D-specific parts into the latter parts of the module, so it becomes more clear what code is intended for use with ISM, and which parts are glue code for usage in the context of SMC-D. This is the fourth part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: Separate SMC-D and ISM APIsStefan Raspl
We separate the code implementing the struct smcd_ops API in the ISM device driver from the functions that may be used by other exploiters of ISM devices. Note: We start out small, and don't offer the whole breadth of the ISM device for public use, as many functions are specific to or likely only ever used in the context of SMC-D. This is the third part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: Register SMC-D as ISM clientStefan Raspl
Register the smc module with the new ism device driver API. This is the second part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/ism: Add new API for client registrationStefan Raspl
Add a new API that allows other drivers to concurrently access ISM devices. To do so, we introduce a new API that allows other modules to register for ISM device usage. Furthermore, we move the GID to struct ism, where it belongs conceptually, and rename and relocate struct smcd_event to struct ism_event. This is the first part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25s390/ism: Introduce struct ism_dmbStefan Raspl
Conceptually, a DMB is a structure that belongs to ISM devices. However, SMC currently 'owns' this structure. So future exploiters of ISM devices would be forced to include SMC headers to work - which is just weird. Therefore, we switch ISM to struct ism_dmb, introduce a new public header with the definition (will be populated with further API calls later on), and, add a thin wrapper to please SMC. Since structs smcd_dmb and ism_dmb are identical, we can simply convert between the two for now. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/ism: Add missing calls to disable bus-masteringStefan Raspl
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: Terminate connections prior to device removalStefan Raspl
Removing an ISM device prior to terminating its associated connections doesn't end well. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25virtio-net: Reduce debug name field size to 16 bytesParav Pandit
virtio queue index can be maximum of 65535. 16 bytes are enough to store the vq name with the existing string prefix. With this change, send queue struct saves 24 bytes and receive queue saves whole cache line worth 64 bytes per structure due to saving in alignment bytes. Pahole results before: pahole -s drivers/net/virtio_net.o | \ grep -e "send_queue" -e "receive_queue" send_queue 1112 0 receive_queue 1280 1 Pahole results after: pahole -s drivers/net/virtio_net.o | \ grep -e "send_queue" -e "receive_queue" send_queue 1088 0 receive_queue 1216 1 Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-24devlink: remove a dubious assumption in fmsg dumpingJakub Kicinski
Build bot detects that err may be returned uninitialized in devlink_fmsg_prepare_skb(). This is not really true because all fmsgs users should create at least one outer nest, and therefore fmsg can't be completely empty. That said the assumption is not trivial to confirm, so let's follow the bots advice, anyway. This code does not seem to have changed since its inception in commit 1db64e8733f6 ("devlink: Add devlink formatted message (fmsg) API") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24net: mscc: ocelot: fix incorrect verify_enabled reporting in ethtool get_mm()Vladimir Oltean
We don't read the verify_enabled variable from hardware in the MAC Merge layer state GET operation, instead we always leave it set to "false". The user may think something is wrong if they set verify_enabled to true, then read it back and see it's still false, even though the configuration took place. Fixes: 6505b6805655 ("net: mscc: ocelot: add MAC Merge layer support for VSC9959") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230123184538.3420098-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24nfp: flower: change get/set_eeprom logic and enable for flower repsJames Hershaw
The changes in this patch are as follows: - Alter the logic of get/set_eeprom functions to use the helper function nfp_app_from_netdev() which handles differentiating between an nfp_net and a nfp_repr. This allows us to get an agnostic backpointer to the pdev. - Enable the various eeprom commands by adding the 'get_eeprom_len', 'get_eeprom', 'set_eeprom' callbacks to the nfp_port_ethtool_ops struct. This allows the eeprom commands to work on representor interfaces, similar to a previous patch which added it to the vnics. Currently these are being used to configure persistent MAC addresses for the physical ports on the nfp. Signed-off-by: James Hershaw <james.hershaw@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230123134135.293278-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24ipv6: Make ip6_route_output_flags_noref() static.Guillaume Nault
This function is only used in net/ipv6/route.c and has no reason to be visible outside of it. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/50706db7f675e40b3594d62011d9363dce32b92e.1674495822.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24netlink: fix spelling mistake in dump size assertJakub Kicinski
Commit 2c7bc10d0f7b ("netlink: add macro for checking dump ctx size") misspelled the name of the assert as asset, missing an R. Reported-by: Ido Schimmel <idosch@idosch.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24Merge branch 'netlink-protocol-specs'Paolo Abeni
Jakub Kicinski says: ==================== Netlink protocol specs I think the Netlink proto specs are far along enough to merge. Filling in all attribute types and quirks will be an ongoing effort but we have enough to cover FOU so it's somewhat complete. I fully intend to continue polishing the code but at the same time I'd like to start helping others base their work on the specs (e.g. DPLL) and need to start working on some new families myself. That's the progress / motivation for merging. The RFC [1] has more of a high level blurb, plus I created a lot of documentation, I'm not going to repeat it here. There was also the talk at LPC [2]. [1] https://lore.kernel.org/all/20220811022304.583300-1-kuba@kernel.org/ [2] https://youtu.be/9QkXIQXkaQk?t=2562 v2: https://lore.kernel.org/all/20220930023418.1346263-1-kuba@kernel.org/ v3: https://lore.kernel.org/all/20230119003613.111778-1-kuba@kernel.org/1 v4: - spec improvements (patch 2) - Python cleanup (patch 3) - rename auto-gen files and use the right comment style ==================== Link: https://lore.kernel.org/r/20230120175041.342573-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24tools: ynl: add a completely generic clientJakub Kicinski
Add a CLI sample which can take in arbitrary request in JSON format, convert it to Netlink and do the inverse for output. It's meant as a development tool primarily and perhaps for selftests which need to tickle netlink in a special way. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: fou: use policy and operation tables generated from the specJakub Kicinski
Generate and plug in the spec-based tables. A little bit of renaming is needed in the FOU code. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: fou: rename the source for linkingJakub Kicinski
We'll need to link two objects together to form the fou module. This means the source can't be called fou, the build system expects fou.o to be the combined object. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: fou: regenerate the uAPI from the specJakub Kicinski
Regenerate the FOU uAPI header from the YAML spec. The flags now come before attributes which use them, and the comments for type disappear (coders should look at the spec instead). Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24netlink: add a proto specification for FOUJakub Kicinski
FOU has a reasonably modern Genetlink family. Add a spec. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: add basic C code generators for NetlinkJakub Kicinski
Code generators to turn Netlink specs into C code. I'm definitely not proud of it. The main generator is in Python, there's a bash script to regen all code-gen'ed files in tree after making spec changes. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24netlink: add schemas for YAML specsJakub Kicinski
Add schemas for Netlink spec files. As described in the docs we have 4 "protocols" or compatibility levels, and each one comes with its own schema, but the more general / legacy schemas are superset of more modern ones: genetlink is the smallest followed by genetlink-c and genetlink-legacy. There is no schema for raw netlink, yet, I haven't found the time.. I don't know enough jsonschema to do inheritance or something but the repetition is not too bad. I hope. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24docs: add more netlink docs (incl. spec docs)Jakub Kicinski
Add documentation about the upcoming Netlink protocol specs. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24Merge branch 'net-sched-use-the-backlog-for-nested-mirred-ingress'Paolo Abeni
Davide Caratti says: ==================== net/sched: use the backlog for nested mirred ingress TC mirred has a protection against excessive stack growth, but that protection doesn't really guarantee the absence of recursion, nor it guards against loops. Patch 1/2 rewords "recursion" to "nesting" to make this more clear. We can leverage on this existing mechanism to prevent TCP / SCTP from doing soft lock-up in some specific scenarios that uses mirred egress->ingress: patch 2 changes mirred so that the networking backlog is used for nested mirred ingress actions. ==================== Link: https://lore.kernel.org/r/cover.1674233458.git.dcaratti@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24act_mirred: use the backlog for nested calls to mirred ingressDavide Caratti
William reports kernel soft-lockups on some OVS topologies when TC mirred egress->ingress action is hit by local TCP traffic [1]. The same can also be reproduced with SCTP (thanks Xin for verifying), when client and server reach themselves through mirred egress to ingress, and one of the two peers sends a "heartbeat" packet (from within a timer). Enqueueing to backlog proved to fix this soft lockup; however, as Cong noticed [2], we should preserve - when possible - the current mirred behavior that counts as "overlimits" any eventual packet drop subsequent to the mirred forwarding action [3]. A compromise solution might use the backlog only when tcf_mirred_act() has a nest level greater than one: change tcf_mirred_forward() accordingly. Also, add a kselftest that can reproduce the lockup and verifies TC mirred ability to account for further packet drops after TC mirred egress->ingress (when the nest level is 1). [1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/ [3] such behavior is not guaranteed: for example, if RPS or skb RX timestamping is enabled on the mirred target device, the kernel can defer receiving the skb and return NET_RX_SUCCESS inside tcf_mirred_forward(). Reported-by: William Zhao <wizhao@redhat.com> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net/sched: act_mirred: better wording on protection against excessive stack ↵Davide Caratti
growth with commit e2ca070f89ec ("net: sched: protect against stack overflow in TC act_mirred"), act_mirred protected itself against excessive stack growth using per_cpu counter of nested calls to tcf_mirred_act(), and capping it to MIRRED_RECURSION_LIMIT. However, such protection does not detect recursion/loops in case the packet is enqueued to the backlog (for example, when the mirred target device has RPS or skb timestamping enabled). Change the wording from "recursion" to "nesting" to make it more clear to readers. CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24Merge branch 'fix-cpts-release-action-in-am65-cpts-driver'Paolo Abeni
Siddharth Vadapalli says: ==================== Fix CPTS release action in am65-cpts driver Delete unreachable code in am65_cpsw_init_cpts() function, which was Reported-by: Leon Romanovsky <leon@kernel.org> at: https://lore.kernel.org/r/Y8aHwSnVK9+sAb24@unreal Remove the devm action associated with am65_cpts_release() and invoke the function directly on the cleanup and exit paths. v4: https://lore.kernel.org/r/20230120044201.357950-1-s-vadapalli@ti.com/ v3: https://lore.kernel.org/r/20230118095439.114222-1-s-vadapalli@ti.com/ v2: https://lore.kernel.org/r/20230116044517.310461-1-s-vadapalli@ti.com/ v1: https://lore.kernel.org/r/20230113104816.132815-1-s-vadapalli@ti.com/ ==================== Link: https://lore.kernel.org/r/20230120070731.383729-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: ethernet: ti: am65-cpsw/cpts: Fix CPTS release actionSiddharth Vadapalli
The am65_cpts_release() function is registered as a devm_action in the am65_cpts_create() function in am65-cpts driver. When the am65-cpsw driver invokes am65_cpts_create(), am65_cpts_release() is added in the set of devm actions associated with the am65-cpsw driver's device. In the event of probe failure or probe deferral, the platform_drv_probe() function invokes dev_pm_domain_detach() which powers off the CPSW and the CPSW's CPTS hardware, both of which share the same power domain. Since the am65_cpts_disable() function invoked by the am65_cpts_release() function attempts to reset the CPTS hardware by writing to its registers, the CPTS hardware is assumed to be powered on at this point. However, the hardware is powered off before the devm actions are executed. Fix this by getting rid of the devm action for am65_cpts_release() and invoking it directly on the cleanup and exit paths. Fixes: f6bd59526ca5 ("net: ethernet: ti: introduce am654 common platform time sync driver") Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: ethernet: ti: am65-cpsw: Delete unreachable error handling codeSiddharth Vadapalli
The am65_cpts_create() function returns -EOPNOTSUPP only when the config "CONFIG_TI_K3_AM65_CPTS" is disabled. Also, in the am65_cpsw_init_cpts() function, am65_cpts_create() can only be invoked if the config "CONFIG_TI_K3_AM65_CPTS" is enabled. Thus, the error handling code for the case in which the return value of am65_cpts_create() is -EOPNOTSUPP, is unreachable. Hence delete it. Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-23net: phy: microchip: run phy initialization during each link updateRakesh Sankaranarayanan
PHY initialization is supposed to run on every mode changes. "lan87xx_config_aneg()" verifies every mode change using "phy_modify_changed()" function. Earlier code had phy_modify_changed() followed by genphy_soft_reset. But soft_reset resets all the pre-configured register values to default state, and lost all the initialization done. With this reason gen_phy_reset was removed. But it need to go through init sequence each time the mode changed. Update lan87xx_config_aneg() to invoke phy_init once successful mode update is detected. PHY init sequence added in lan87xx_phy_init() have slave init commands executed every time. Update the init sequence to run slave init only if phydev is in slave mode. Test setup contains LAN9370 EVB connected to SAMA5D3 (Running DSA), and issue can be reproduced by connecting link to any of the available ports after SAMA5D3 boot-up. With this issue, port will fail to update link state. But once the SAMA5D3 is reset with LAN9370 link in connected state itself, on boot-up link state will be reported as UP. But Again after some time, if link is moved to DOWN state, it will not get reported. Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230120104733.724701-1-rakesh.sankaranarayanan@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23Merge branch 'net-dsa-microchip-add-support-for-credit-based-shaper'Jakub Kicinski
Arun Ramadoss says: ==================== net: dsa: microchip: add support for credit based shaper LAN937x switch family, KSZ9477, KSZ9567, KSZ9563 and KSZ8563 supports the credit based shaper. But there were few difference between LAN937x and KSZ switch like - number of queues for LAN937x is 8 and for others it is 4. - size of credit increment register for LAN937x is 24 and for other is 16-bit. This patch series add the credit based shaper with common implementation for LAN937x and KSZ swithes. ==================== Link: https://lore.kernel.org/r/20230120052135.32120-1-arun.ramadoss@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23net: dsa: microchip: add support for credit based shaperArun Ramadoss
KSZ9477, KSZ9567, KSZ9563, KSZ8563 and LAN937x supports Credit based shaper. To differentiate the chip supporting cbs, tc_cbs_supported flag is introduced in ksz_chip_data. And KSZ series has 16bit Credit increment registers whereas LAN937x has 24bit register. The value to be programmed in the credit increment is determined using the successive multiplication method to convert decimal fraction to hexadecimal fraction. For example: if idleslope is 10000 and sendslope is -90000, then bandwidth is 10000 - (-90000) = 100000. The 10% bandwidth of 100Mbps means 10/100 = 0.1(decimal). This value has to be converted to hexa. 1) 0.1 * 16 = 1.6 --> fraction 0.6 Carry = 1 (MSB) 2) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 3) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 4) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 5) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 6) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 (LSB) Now 0.1(decimal) becomes 0.199999(Hex). If it is LAN937x, 24 bit value will be programmed to Credit Inc register, 0x199999. For others 16 bit value will be prgrammed, 0x1999. Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>