Age | Commit message (Collapse) | Author |
|
Note that we can't guess the listener family anymore based on the client
target address: always use IPv6.
The fullmesh flag with endpoints from different families is also
validated here.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Usually, attributes are propagated to subflows as well.
Here, if subflows are created by other ways than the MPTCP path-manager,
it is important to make sure they are in v6 if it is asked by the
userspace.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.
This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.
A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
There are multiple ICMP rate limiting mechanisms:
* Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
* v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
* v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask
However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.
Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.
Example output:
# nstat -sz "*RateLimit*"
IcmpOutRateLimitGlobal 134 0.0
IcmpOutRateLimitHost 770 0.0
Icmp6OutRateLimitHost 84 0.0
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Steen Hegelund says:
====================
Adding Sparx5 IS0 VCAP support
This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
Processor) support for the Sparx5 platform.
The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
mainly extracts frame information to metadata that follows the frame in the
Sparx5 processing flow all the way to the egress port.
The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:
- chain 1000000: IS0 Lookup 0
- chain 1100000: IS0 Lookup 1
- chain 1200000: IS0 Lookup 2
- chain 1300000: IS0 Lookup 3
- chain 1400000: IS0 Lookup 4
- chain 1500000: IS0 Lookup 5
Each of these lookups have their own port keyset configuration that decides
which keys will be used for matching on which traffic type.
The IS0 VCAP has these traffic classifications:
- IPv4 frames
- IPv6 frames
- Unicast MPLS frames (ethertype = 0x8847)
- Multicast MPLS frames (ethertype = 0x8847)
- Other frame types than MPLS, IPv4 and IPv6
The IS0 VCAP has an action that allows setting the value of a PAG (Policy
Association Group) key field in the frame metadata, and this can be used
for matching in an IS2 VCAP rule.
This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.
The linking is exposed by using the TC "goto chain" action with an offset
from the IS2 chain ids.
As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
a rule in IS2 Lookup 0.
====================
Link: https://lore.kernel.org/r/20230124104511.293938-1-steen.hegelund@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds support for parsing and matching on the CVLAN tags in the Sparx5
IS0 VCAP.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This allows the IS0 VCAP to have its own list of supported ethernet
protocol types matching what is supported by the VCAPs port lookup
classification.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.
Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This enables the TC command to use the Sparx5 IS0 VCAP
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds the actionset type id to the rule information. This is needed as
we now have more than one actionset in a VCAP instance (IS0).
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds the IS0 VCAP port keyset configuration for Sparx5 and also
updates the debugFS support to show the keyset configuration.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5.
This VCAP provides classification actions for Sparx5.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Force the internal PHY off then on when switching to the internal path.
This fixes problems where the PHY ID is not properly set.
Fixes: 7090425104db ("net: phy: add amlogic g12a mdio mux support")
Suggested-by: Qi Duan <qi.duan@amlogic.com>
Co-developed-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Link: https://lore.kernel.org/r/20230124101157.232234-1-jbrunet@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jakub Sitnicki says:
====================
Add IP_LOCAL_PORT_RANGE socket option
This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].
Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.
Documentation
-------------
Proposed update to the ip(7) man-page:
IP_LOCAL_PORT_RANGE (since Linux X.Y)
Set or get the per-socket default local port range. This
option can be used to clamp down the global local port
range, defined by the ip_local_port_range /proc interface
described below, for a given socket.
The option takes an uint32_t value with the high 16 bits
set to the upper range bound, and the low 16 bits set to
the lower range bound. Range bounds are inclusive. The
16-bit values should be in host byte order.
The lower bound has to be less than the upper bound when
both bounds are not zero. Otherwise, setting the option
fails with EINVAL.
If either bound is outside of the global local port range,
or is zero, then that bound has no effect.
To reset the setting, pass zero as both the upper and the
lower bound.
Interaction with SELinux bind() hook
------------------------------------
SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.
The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.
Hence, there is no interaction with SELinux, AFAICT.
RFC -> v1
RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/
* Allow either the high bound or the low bound, or both, to be zero
* Add getsockopt support
* Add selftests
Links:
------
[1]: https://lpc.events/event/16/contributions/1349/
====================
Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:
1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Current documentation URL [1] is no longer valid.
[1] https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124145127.189221-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
netif_stop_queue() and netif_wake_queue() act on TX queue 0. This is ok
as long as only a single TX queue is supported. But support for multiple
TX queues was introduced with 762031375d5c and I missed to adapt stop
and wake of TX queues.
Use netif_stop_subqueue() and netif_tx_wake_queue() to act on specific
TX queue.
Fixes: 762031375d5c ("tsnep: Support multiple TX/RX queue pairs")
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Link: https://lore.kernel.org/r/20230124191440.56887-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix spelling in net/ Kconfig files.
(reported by codespell)
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: coreteam@netfilter.org
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
During EEH error injection testing, a deadlock was encountered in the tg3
driver when tg3_io_error_detected() was attempting to cancel outstanding
reset tasks:
crash> foreach UN bt
...
PID: 159 TASK: c0000000067c6000 CPU: 8 COMMAND: "eehd"
...
#5 [c00000000681f990] __cancel_work_timer at c00000000019fd18
#6 [c00000000681fa30] tg3_io_error_detected at c00800000295f098 [tg3]
#7 [c00000000681faf0] eeh_report_error at c00000000004e25c
...
PID: 290 TASK: c000000036e5f800 CPU: 6 COMMAND: "kworker/6:1"
...
#4 [c00000003721fbc0] rtnl_lock at c000000000c940d8
#5 [c00000003721fbe0] tg3_reset_task at c008000002969358 [tg3]
#6 [c00000003721fc60] process_one_work at c00000000019e5c4
...
PID: 296 TASK: c000000037a65800 CPU: 21 COMMAND: "kworker/21:1"
...
#4 [c000000037247bc0] rtnl_lock at c000000000c940d8
#5 [c000000037247be0] tg3_reset_task at c008000002969358 [tg3]
#6 [c000000037247c60] process_one_work at c00000000019e5c4
...
PID: 655 TASK: c000000036f49000 CPU: 16 COMMAND: "kworker/16:2"
...:1
#4 [c0000000373ebbc0] rtnl_lock at c000000000c940d8
#5 [c0000000373ebbe0] tg3_reset_task at c008000002969358 [tg3]
#6 [c0000000373ebc60] process_one_work at c00000000019e5c4
...
Code inspection shows that both tg3_io_error_detected() and
tg3_reset_task() attempt to acquire the RTNL lock at the beginning of
their code blocks. If tg3_reset_task() should happen to execute between
the times when tg3_io_error_deteced() acquires the RTNL lock and
tg3_reset_task_cancel() is called, a deadlock will occur.
Moving tg3_reset_task_cancel() call earlier within the code block, prior
to acquiring RTNL, prevents this from happening, but also exposes another
deadlock issue where tg3_reset_task() may execute AFTER
tg3_io_error_detected() has executed:
crash> foreach UN bt
PID: 159 TASK: c0000000067d2000 CPU: 9 COMMAND: "eehd"
...
#4 [c000000006867a60] rtnl_lock at c000000000c940d8
#5 [c000000006867a80] tg3_io_slot_reset at c0080000026c2ea8 [tg3]
#6 [c000000006867b00] eeh_report_reset at c00000000004de88
...
PID: 363 TASK: c000000037564000 CPU: 6 COMMAND: "kworker/6:1"
...
#3 [c000000036c1bb70] msleep at c000000000259e6c
#4 [c000000036c1bba0] napi_disable at c000000000c6b848
#5 [c000000036c1bbe0] tg3_reset_task at c0080000026d942c [tg3]
#6 [c000000036c1bc60] process_one_work at c00000000019e5c4
...
This issue can be avoided by aborting tg3_reset_task() if EEH error
recovery is already in progress.
Fixes: db84bf43ef23 ("tg3: tg3_reset_task() needs to use rtnl_lock to synchronize")
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124185339.225806-1-drc@linux.vnet.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Kui-Feng Lee says:
====================
This patchset implements a change to bpf_setsockopt() which allows
ktls enabled sockets to be used with the SOL_TCP level. This is
necessary as when ktls is enabled, it changes the function pointer of
setsockopt of the socket, which bpf_setsockopt() checks in order to
make sure that the socket is a TCP socket. Checking sk_protocol
instead of the function pointer will ensure that bpf_setsockopt() with
the SOL_TCP level still works on sockets with ktls enabled.
The major differences form v2 are:
- Add a read() call to make sure that the FIN has arrived.
- Remove the dependency on other test's header.
The major differences from v1 are:
- Test with a IPv6 connect as well.
- Use ASSERT_OK()
v2: https://lore.kernel.org/bpf/20230124181220.2871611-1-kuifeng@meta.com/
v1: https://lore.kernel.org/bpf/20230121025716.3039933-1-kuifeng@meta.com/
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Ensures that whenever bpf_setsockopt() is called with the SOL_TCP
option on a ktls enabled socket, the call will be accepted by the
system. The provided test makes sure of this by performing an
examination when the server side socket is in the CLOSE_WAIT state. At
this stage, ktls is still enabled on the server socket and can be used
to test if bpf_setsockopt() works correctly with linux.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230125201608.908230-3-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Resolve an issue when calling sol_tcp_sockopt() on a socket with ktls
enabled. Prior to this patch, sol_tcp_sockopt() would only allow calls
if the function pointer of setsockopt of the socket was set to
tcp_setsockopt(). However, any socket with ktls enabled would have its
function pointer set to tls_setsockopt(). To resolve this issue, the
patch adds a check of the protocol of the linux socket and allows
bpf_setsockopt() to be called if ktls is initialized on the linux
socket. This ensures that calls to sol_tcp_sockopt() will succeed on
sockets with ktls enabled.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230125201608.908230-2-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
David Vernet says:
====================
This is part 4 of https://lore.kernel.org/bpf/20230123232228.646563-1-void@manifault.com/
Part 3: https://lore.kernel.org/all/20230125050359.339273-1-void@manifault.com/
Part 2: https://lore.kernel.org/all/20230124160802.1122124-1-void@manifault.com/
Changelog:
----------
v3 -> v4:
- Fix accidental typo in name of dummy_st_ops introduced in v2, moving
it back to dummy_st_ops from dummy_st_ops_success. Should fix s390x
testruns.
v2 -> v3:
- Don't call a KF_SLEEPABLE kfunc from the dummy_st_ops testsuite, and
remove the newly added bpf_kfunc_call_test_sleepable() test kfunc
(Martin).
- Include vmlinux.h from progs/dummy_st_ops_success.c (previously
progs/dummy_st_ops.c) rather than manually defining
struct bpf_dummy_ops_state and struct bpf_dummy_ops.
(Martin).
- Fix a typo added to prog_tests/dummy_st_ops.c in a previous version:
s/trace_dummy_st_ops_success__open/trace_dummy_st_ops__open.
v1 -> v2:
- Add support for specifying sleepable struct_ops programs with
struct_ops.s in libbpf (Alexei).
- Move failure test case into new dummy_st_ops_fail.c prog file.
- Update test_dummy_sleepable() to use struct_ops.s instead of manually
setting prog flags. Also remove open_load_skel() helper which is no
longer needed.
- Fix verifier tests to expect new sleepable prog failure message.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In a set of prior changes, we added the ability for struct_ops programs
to be sleepable. This patch enhances the dummy_st_ops selftest suite to
validate this behavior by adding a new sleepable struct_ops entry to
dummy_st_ops.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-5-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The .check_member field of struct bpf_struct_ops is currently passed the
member's btf_type via const struct btf_type *t, and a const struct
btf_member *member. This allows the struct_ops implementation to check
whether e.g. an ops is supported, but it would be useful to also enforce
that the struct_ops prog being loaded for that member has other
qualities, like being sleepable (or not). This patch therefore updates
the .check_member() callback to also take a const struct bpf_prog *prog
argument.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In a prior change, the verifier was updated to support sleepable
BPF_PROG_TYPE_STRUCT_OPS programs. A caller could set the program as
sleepable with bpf_program__set_flags(), but it would be more ergonomic
and more in-line with other sleepable program types if we supported
suffixing a struct_ops section name with .s to indicate that it's
sleepable.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
BPF struct_ops programs currently cannot be marked as sleepable. This
need not be the case -- struct_ops programs can be sleepable, and e.g.
invoke kfuncs that export the KF_SLEEPABLE flag. So as to allow future
struct_ops programs to invoke such kfuncs, this patch updates the
verifier to allow struct_ops programs to be sleepable. A follow-on patch
will add support to libbpf for specifying struct_ops.s as a sleepable
struct_ops program, and then another patch will add testcases to the
dummy_st_ops selftest suite which test sleepable struct_ops behavior.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
As stated in README.rst, in order to resolve errors with linker errors,
'LDLIBS=-static' should be used. Most problems will be solved by this
option, but in the case of urandom_read, this won't fix the problem. So
the Makefile is currently implemented to strip the 'static' option when
compiling the urandom_read. However, stripping this static option isn't
configured properly on $(LDLIBS) correctly, which is now causing errors
on static compilation.
# LDLIBS=-static ./vmtest.sh
ld.lld: error: attempted static link of dynamic object liburandom_read.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:190: /linux/tools/testing/selftests/bpf/urandom_read] Error 1
make: *** Waiting for unfinished jobs....
This commit fixes this problem by configuring the strip with $(LDLIBS).
Fixes: 68084a136420 ("selftests/bpf: Fix building bpf selftests statically")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230125100440.21734-1-danieltimlee@gmail.com
|
|
HOSTCC is always wanted when building. Setting CC to HOSTCC happens
after tools/scripts/Makefile.include is included, meaning flags are
set assuming say CC is gcc, but then it can be later set to HOSTCC
which may be clang. tools/scripts/Makefile.include is needed for host
set up and common macros in objtool's Makefile. Rather than override
CC to HOSTCC, just pass CC as HOSTCC to Makefile.build, the libsubcmd
builds and the linkage step. This means the Makefiles don't see things
like CC changing and tool flag determination, and similar, work
properly.
Also, clear the passed subdir as otherwise an outer build may break by
inadvertently passing an inappropriate value.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230124064324.672022-2-irogers@google.com
|
|
Previously tools/lib/subcmd was added to the include path, switch to
installing the headers and then including from that directory. This
avoids dependencies on headers internal to tools/lib/subcmd. Add the
missing subcmd directory to the affected #include.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230124064324.672022-1-irogers@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull fuse ACL fix from Christian Brauner:
"The new posix acl API doesn't depend on the xattr handler
infrastructure anymore and instead only relies on the posix acl inode
operations. As a result daemons without FUSE_POSIX_ACL are unable to
use posix acls like they used to.
Fix this by copying what we did for overlayfs during the posix acl api
conversion. Make fuse implement a dedicated ->get_inode_acl() method
as does overlayfs. Fuse can then also uses this to express different
needs for vfs permission checking during lookup and acl based
retrieval via the regular system call path.
This allows fuse to continue to refuse retrieving posix acls for
daemons that don't set FUSE_POSXI_ACL for permission checking while
also allowing a fuse server to retrieve it via the usual system calls"
* tag 'fs.fuse.acl.v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fuse: fixes after adapting to new posix acl api
|
|
Since the latest Intel hardware does both IWARP and ROCE, rename the
term IWARP in the virtchnl header to be RDMA. Do this for both upper and
lower case instances. Many of the non-virtchnl.h changes were done with
regular expression replacements using perl like:
perl -p -i -e 's/_IWARP/_RDMA/' <files>
perl -p -i -e 's/_iwarp/_rdma/' <files>
and I had to pick up a few instances manually.
The virtchnl.h header has some comments and clarity added around when to
use certain defines.
note: had to fix a checkpatch warning for a long line by wrapping one of
the lines I changed.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The virtchnl interface can have a bunch of "soft" defined structures
hardened by using explicit sizes for declarations, and then referring to
the enum type that uses them in a comment. None of these changes should
change any of the structure sizes.
Also, remove a duplicate line in a switch statement and let two cases
uses the same code.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
We already have the SPDX header, so just leave a copyright notice with
an updated year and get rid of the boilerplate header (so 2002!).
In addition, update a couple of comments to clarify how the various
parts of the virtchannel header interaction work.
No functional changes.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Nothing uses virtchnl_msg, just remove it.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
David Vernet says:
====================
This is part 3 of https://lore.kernel.org/all/20230119235833.2948341-1-void@manifault.com/
Part 2: https://lore.kernel.org/bpf/20230120192523.3650503-1-void@manifault.com/
This series is based off of commit b613d335a743 ("bpf: Allow trusted
args to walk struct when checking BTF IDs").
Changelog:
----------
v2 -> v3:
- Rebase onto master (commit described above). Only conflict that
required resolution was updating the task_kfunc selftest suite error
message location.
- Put copyright onto one line in kernel/bpf/cpumask.c.
- Remove now-unneeded pid-checking logic from
progs/nested_trust_success.c.
- Fix a couple of small grammatical typos in documentation.
v1 -> v2:
- Put back 'static' keyword in bpf_find_btf_id()
(kernel test robot <lkp@intel.com>)
- Surround cpumask kfuncs in __diag() blocks to avoid no-prototype build
warnings (kernel test robot <lkp@intel.com>)
- Enable ___init suffixes to a type definition to signal that a type is
a nocast alias of another type. That is, that when passed to a kfunc
that expects one of the two types, the verifier will reject the other
even if they're equivalent according to the C standard (Kumar and
Alexei)
- Reject NULL for all trusted args, not just PTR_TO_MEM (Kumar)
- Reject both NULL and PTR_MAYBE_NULL for all trusted args (Kumar and
Alexei )
- Improve examples given in cpumask documentation (Alexei)
- Use __success macro for nested_trust test (Alexei)
- Fix comment typo in struct bpf_cpumask comment header.
- Fix another example in the bpf_cpumask doc examples.
- Add documentation for ___init suffix change mentioned above.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When comparing BTF IDs for pointers being passed to kfunc arguments, the
verifier will allow pointer types that are equivalent according to the C
standard. For example, for:
struct bpf_cpumask {
cpumask_t cpumask;
refcount_t usage;
};
The verifier will allow a struct bpf_cpumask * to be passed to a kfunc
that takes a const struct cpumask * (cpumask_t is a typedef of struct
cpumask). The exception to this rule is if a type is suffixed with
___init, such as:
struct nf_conn___init {
struct nf_conn ct;
};
The verifier will _not_ allow a struct nf_conn___init * to be passed to
a kfunc that expects a struct nf_conn *. This patch documents this
behavior in the kfuncs documentation page.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-8-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A prior change defined a new BTF_TYPE_SAFE_NESTED macro in the verifier
which allows developers to specify when a pointee field in a struct type
should inherit its parent pointer's trusted status. This patch updates
the kfuncs documentation to specify this macro and how it can be used.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-7-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Now that we've added a series of new cpumask kfuncs, we should document
them so users can easily use them. This patch adds a new cpumasks.rst
file to document them.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-6-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A recent patch added a new set of kfuncs for allocating, freeing,
manipulating, and querying cpumasks. This patch adds a new 'cpumask'
selftest suite which verifies their behavior.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-5-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Now that defining trusted fields in a struct is supported, we should add
selftests to verify the behavior. This patch adds a few such testcases.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Certain programs may wish to be able to query cpumasks. For example, if
a program that is tracing percpu operations wishes to track which tasks
end up running on which CPUs, it could be useful to associate that with
the tasks' cpumasks. Similarly, programs tracking NUMA allocations, CPU
scheduling domains, etc, could potentially benefit from being able to
see which CPUs a task could be migrated to.
This patch enables these types of use cases by introducing a series of
bpf_cpumask_* kfuncs. Amongst these kfuncs, there are two separate
"classes" of operations:
1. kfuncs which allow the caller to allocate and mutate their own
cpumask kptrs in the form of a struct bpf_cpumask * object. Such
kfuncs include e.g. bpf_cpumask_create() to allocate the cpumask, and
bpf_cpumask_or() to mutate it. "Regular" cpumasks such as p->cpus_ptr
may not be passed to these kfuncs, and the verifier will ensure this
is the case by comparing BTF IDs.
2. Read-only operations which operate on const struct cpumask *
arguments. For example, bpf_cpumask_test_cpu(), which tests whether a
CPU is set in the cpumask. Any trusted struct cpumask * or struct
bpf_cpumask * may be passed to these kfuncs. The verifier allows
struct bpf_cpumask * even though the kfunc is defined with struct
cpumask * because the first element of a struct bpf_cpumask is a
cpumask_t, so it is safe to cast.
A follow-on patch will add selftests which validate these kfuncs, and
another will document them.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
KF_TRUSTED_ARGS kfuncs currently have a subtle and insidious bug in
validating pointers to scalars. Say that you have a kfunc like the
following, which takes an array as the first argument:
bool bpf_cpumask_empty(const struct cpumask *cpumask)
{
return cpumask_empty(cpumask);
}
...
BTF_ID_FLAGS(func, bpf_cpumask_empty, KF_TRUSTED_ARGS)
...
If a BPF program were to invoke the kfunc with a NULL argument, it would
crash the kernel. The reason is that struct cpumask is defined as a
bitmap, which is itself defined as an array, and is accessed as a memory
address by bitmap operations. So when the verifier analyzes the
register, it interprets it as a pointer to a scalar struct, which is an
array of size 8. check_mem_reg() then sees that the register is NULL and
returns 0, and the kfunc crashes when it passes it down to the cpumask
wrappers.
To fix this, this patch adds a check for KF_ARG_PTR_TO_MEM which
verifies that the register doesn't contain a possibly-NULL pointer if
the kfunc is KF_TRUSTED_ARGS.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Jeremy Kerr says:
====================
net: mctp: struct sock lifetime fixes
This series is a set of fixes for the sock lifetime handling in the
AF_MCTP code, fixing a uaf reported by Noam Rathaus
<noamr@ssd-disclosure.com>.
The Fixes: tags indicate the original patches affected, but some
tweaking to backport to those commits may be needed; I have a separate
branch with backports to 5.15 if that helps with stable trees.
Of course, any comments/queries most welcome.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Once a socket has been unhashed, we want to prevent it from being
re-used in a sk_key entry as part of a routing operation.
This change marks the sk as SOCK_DEAD on unhash, which prevents addition
into the net's key list.
We need to do this during the key add path, rather than key lookup, as
we release the net keys_lock between those operations.
Fixes: 4a992bbd3650 ("mctp: Implement message fragmentation & reassembly")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, we have a race where we look up a sock through a "general"
(ie, not directly associated with the (src,dest,tag) tuple) key, then
drop the key reference while still holding the key's sock.
This change expands the key reference until we've finished using the
sock, and hence the sock reference too.
Commit message changes from Jeremy Kerr <jk@codeconstruct.com.au>.
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Fixes: 73c618456dc5 ("mctp: locking, lifetime and validity changes for sk_keys")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, we delete the key expiry timer (in sk->close) before
unhashing the sk. This means that another thread may find the sk through
its presence on the key list, and re-queue the timer.
This change moves the timer deletion to the unhash, after we have made
the key no longer observable, so the timer cannot be re-queued.
Fixes: 7b14e15ae6f4 ("mctp: Implement a timeout for tags")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, we correlate the mctp_sk_key lifetime to the sock lifetime
through the sock hash/unhash operations, but this is pretty tenuous, and
there are cases where we may have a temporary reference to an unhashed
sk.
This change makes the reference more explicit, by adding a hold on the
sock when it's associated with a mctp_sk_key, released on final key
unref.
Fixes: 73c618456dc5 ("mctp: locking, lifetime and validity changes for sk_keys")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|