Age | Commit message (Collapse) | Author |
|
When SHAMPO is used, a receive queue currently almost always leaks one
page on shutdown.
A page has MLX5E_SHAMPO_WQ_HEADER_PER_PAGE (8) headers. These headers
are tracked in the SHAMPO bitmap. Each page is released when the last
header index in the group is processed. During header allocation, there
can be leftovers from a page that will be used in a subsequent
allocation. This is normally fine, except for the following scenario
(simplified a bit):
1) Allocate N new page fragments, showing only the relevant last 4
fragments:
0: new page
1: new page
2: new page
3: new page
4: page from previous allocation
5: page from previous allocation
6: page from previous allocation
7: page from previous allocation
2) NAPI processes header indices 4-7 because they are the oldest
allocated. Bit 7 will be set to 0.
3) Receive queue shutdown occurs. All the remaining bits are being
iterated on to release the pages. But the page assigned to header
indices 0-3 will not be freed due to what happened in step 2.
This patch fixes the issue by making sure that on allocation, header
fragments are always allocated in groups of
MLX5E_SHAMPO_WQ_HEADER_PER_PAGE so that there is never a partial page
left over between allocations.
A more appropriate fix would be a refactoring of
mlx5e_alloc_rx_hd_mpwqe() and mlx5e_build_shampo_hd_umr(). But this
refactoring is too big for net. It will be targeted for net-next.
Fixes: e839ac9a89cb ("net/mlx5e: SHAMPO, Simplify header page release in teardown")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240815071611.2211873-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Vladimir Oltean says:
====================
VLAN fixes for Ocelot driver
This is a collection of patches I've gathered over the past several
months.
Patches 1-6/14 are supporting patches for selftests.
Patch 9/14 fixes PTP TX from a VLAN upper of a VLAN-aware bridge port
when using the "ocelot-8021q" tagging protocol. Patch 7/14 is its
supporting selftest.
Patch 10/14 fixes the QoS class used by PTP in the same case as above.
It is hard to quantify - there is no selftest.
Patch 11/14 fixes potential data corruption during PTP TX in the same
case as above. Again, there is no selftest.
Patch 13/14 fixes RX in the same case as above - 8021q upper of a
VLAN-aware bridge port, with the "ocelot-8021q" tagging protocol. Patch
12/14 is a supporting patch for this in the DSA core, and 7/14 is also
its selftest.
Patch 14/14 ensures that VLAN-aware bridges offloaded to Ocelot only
react to the ETH_P_8021Q TPID, and treat absolutely everything else as
VLAN-untagged, including ETH_P_8021AD. Patch 8/14 is the supporting
selftest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I was revisiting the topic of 802.1ad treatment in the Ocelot switch [0]
and realized that not only is its basic VLAN classification pipeline
improper for offloading vlan_protocol 802.1ad bridges, but also improper
for offloading regular 802.1Q bridges already.
Namely, 802.1ad-tagged traffic should be treated as VLAN-untagged by
bridged ports, but this switch treats it as if it was 802.1Q-tagged with
the same VID as in the 802.1ad header. This is markedly different to
what the Linux bridge expects; see the "other_tpid()" function in
tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh.
An idea came to me that the VCAP IS1 TCAM is more powerful than I'm
giving it credit for, and that it actually overwrites the classified VID
before the VLAN Table lookup takes place. In other words, it can be
used even to save a packet from being dropped on ingress due to VLAN
membership.
Add a sophisticated TCAM rule hardcoded into the driver to force the
switch to behave like a Linux bridge with vlan_filtering 1 vlan_protocol
802.1Q.
Regarding the lifetime of the filter: eventually the bridge will
disappear, and vlan_filtering on the port will be restored to 0 for
standalone mode. Then the filter will be deleted.
[0]: https://lore.kernel.org/netdev/20201009122947.nvhye4hvcha3tljh@skbuf/
Fixes: 7142529f1688 ("net: mscc: ocelot: add VLAN filtering")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is a major design bug with ocelot-8021q, which is that it expects
more of the hardware than the hardware can actually do. The short
summary of the issue is that when a port is under a VLAN-aware bridge
and we use this tagging protocol, VLAN upper interfaces of this port do
not see RX traffic.
We use VCAP ES0 (egress rewriter) rules towards the tag_8021q CPU port
to encapsulate packets with an outer tag, later stripped by software,
that depends on the source user port. We do this so that packets can be
identified in ocelot_rcv(). To be precise, we create rules with
push_outer_tag = OCELOT_ES0_TAG and push_inner_tag = 0.
With this configuration, we expect the switch to keep the inner tag
configuration as found in the packet (if it was untagged on user port
ingress, keep it untagged, otherwise preserve the VLAN tag unmodified
as the inner tag towards the tag_8021q CPU port). But this is not what
happens.
Instead, table "Tagging Combinations" from the user manual suggests
that when the ES0 action is "PUSH_OUTER_TAG=1 and PUSH_INNER_TAG=0",
there will be "no inner tag". Experimentation further clarifies what
this means.
It appears that this "inner tag" which is not pushed into the packet on
its egress towards the CPU is none other than the classified VLAN.
When the ingress user port is standalone or under a VLAN-unaware bridge,
the classified VLAN is a discardable quantity: it is a fixed value - the
result of ocelot_vlan_unaware_pvid()'s configuration, and actually
independent of the VID from any 802.1Q header that may be in the frame.
It is actually preferable to discard the "inner tag" in this case.
The problem is when the ingress port is under a VLAN-aware bridge.
Then, the classified VLAN is taken from the frame's 802.1Q header, with
a fallback on the bridge port's PVID. It would be very good to not
discard the "inner tag" here, because if we do, we break communication
with any 8021q VLAN uppers that the port might have. These have a
processing path outside the bridge.
There seems to be nothing else we can do except to change the
configuration for VCAP ES0 rules, to actually push the inner VLAN into
the frame. There are 2 options for that, first is to push a fixed value
specified in the rule, and second is to push a fixed value, plus
(aka arithmetic +) the classified VLAN. We choose the second option,
and we select that fixed value as 0. Thus, what is pushed in the inner
tag is just the classified VLAN.
From there, we need to perform software untagging, in the receive path,
of stuff that was untagged on the wire.
Fixes: 7c83a7c539ab ("net: dsa: add a second tagger for Ocelot switches based on tag_8021q")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Through code analysis, I realized that the ds->untag_bridge_pvid logic
is contradictory - see the newly added FIXME above the kernel-doc for
dsa_software_untag_vlan_unaware_bridge().
Moreover, for the Felix driver, I need something very similar, but which
is actually _not_ contradictory: untag the bridge PVID on RX, but for
VLAN-aware bridges. The existing logic does it for VLAN-unaware bridges.
Since I don't want to change the functionality of drivers which were
supposedly properly tested with the ds->untag_bridge_pvid flag, I have
introduced a new one: ds->untag_vlan_aware_bridge_pvid, and I have
refactored the DSA reception code into a common path for both flags.
TODO: both flags should be unified under a single ds->software_vlan_untag,
which users of both current flags should set. This is not something that
can be carried out right away. It needs very careful examination of all
drivers which make use of this functionality, since some of them
actually get this wrong in the first place.
For example, commit 9130c2d30c17 ("net: dsa: microchip: ksz8795: Use
software untagging on CPU port") uses this in a driver which has
ds->configure_vlan_while_not_filtering = true. The latter mechanism has
been known for many years to be broken by design:
https://lore.kernel.org/netdev/CABumfLzJmXDN_W-8Z=p9KyKUVi_HhS7o_poBkeKHS2BkAiyYpw@mail.gmail.com/
and we have the situation of 2 bugs canceling each other. There is no
private VLAN, and the port follows the PVID of the VLAN-unaware bridge.
So, it's kinda ok for that driver to use the ds->untag_bridge_pvid
mechanism, in a broken way.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As explained by Horatiu Vultur in commit 603ead96582d ("net: sparx5: Add
spinlock for frame transmission from CPU") which is for a similar
hardware design, multiple CPUs can simultaneously perform injection
or extraction. There are only 2 register groups for injection and 2
for extraction, and the driver only uses one of each. So we'd better
serialize access using spin locks, otherwise frame corruption is
possible.
Note that unlike in sparx5, FDMA in ocelot does not have this issue
because struct ocelot_fdma_tx_ring already contains an xmit_lock.
I guess this is mostly a problem for NXP LS1028A, as that is dual core.
I don't think VSC7514 is. So I'm blaming the commit where LS1028A (aka
the felix DSA driver) started using register-based packet injection and
extraction.
Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are 2 distinct code paths (listed below) in the source code which
set up an injection header for Ocelot(-like) switches. Code path (2)
lacks the QoS class and source port being set correctly. Especially the
improper QoS classification is a problem for the "ocelot-8021q"
alternative DSA tagging protocol, because we support tc-taprio and each
packet needs to be scheduled precisely through its time slot. This
includes PTP, which is normally assigned to a traffic class other than
0, but would be sent through TC 0 nonetheless.
The code paths are:
(1) ocelot_xmit_common() from net/dsa/tag_ocelot.c - called only by the
standard "ocelot" DSA tagging protocol which uses NPI-based
injection - sets up bit fields in the tag manually to account for
a small difference (destination port offset) between Ocelot and
Seville. Namely, ocelot_ifh_set_dest() is omitted out of
ocelot_xmit_common(), because there's also seville_ifh_set_dest().
(2) ocelot_ifh_set_basic(), called by:
- ocelot_fdma_prepare_skb() for FDMA transmission of the ocelot
switchdev driver
- ocelot_port_xmit() -> ocelot_port_inject_frame() for
register-based transmission of the ocelot switchdev driver
- felix_port_deferred_xmit() -> ocelot_port_inject_frame() for the
DSA tagger ocelot-8021q when it must transmit PTP frames (also
through register-based injection).
sets the bit fields according to its own logic.
The problem is that (2) doesn't call ocelot_ifh_set_qos_class().
Copying that logic from ocelot_xmit_common() fixes that.
Unfortunately, although desirable, it is not easily possible to
de-duplicate code paths (1) and (2), and make net/dsa/tag_ocelot.c
directly call ocelot_ifh_set_basic()), because of the ocelot/seville
difference. This is the "minimal" fix with some logic duplicated (but
at least more consolidated).
Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
register injection
Problem description
-------------------
On an NXP LS1028A (felix DSA driver) with the following configuration:
- ocelot-8021q tagging protocol
- VLAN-aware bridge (with STP) spanning at least swp0 and swp1
- 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700
- ptp4l on swp0.700 and swp1.700
we see that the ptp4l instances do not see each other's traffic,
and they all go to the grand master state due to the
ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition.
Jumping to the conclusion for the impatient
-------------------------------------------
There is a zero-day bug in the ocelot switchdev driver in the way it
handles VLAN-tagged packet injection. The correct logic already exists in
the source code, in function ocelot_xmit_get_vlan_info() added by commit
5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
But it is used only for normal NPI-based injection with the DSA "ocelot"
tagging protocol. The other injection code paths (register-based and
FDMA-based) roll their own wrong logic. This affects and was noticed on
the DSA "ocelot-8021q" protocol because it uses register-based injection.
By moving ocelot_xmit_get_vlan_info() to a place that's common for both
the DSA tagger and the ocelot switch library, it can also be called from
ocelot_port_inject_frame() in ocelot.c.
We need to touch the lines with ocelot_ifh_port_set()'s prototype
anyway, so let's rename it to something clearer regarding what it does,
and add a kernel-doc. ocelot_ifh_set_basic() should do.
Investigation notes
-------------------
Debugging reveals that PTP event (aka those carrying timestamps, like
Sync) frames injected into swp0.700 (but also swp1.700) hit the wire
with two VLAN tags:
00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc
~~~~~~~~~~~
00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00
~~~~~~~~~~~
00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03
00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00
00000040: 00 00
The second (unexpected) VLAN tag makes felix_check_xtr_pkt() ->
ptp_classify_raw() fail to see these as PTP packets at the link
partner's receiving end, and return PTP_CLASS_NONE (because the BPF
classifier is not written to expect 2 VLAN tags).
The reason why packets have 2 VLAN tags is because the transmission
code treats VLAN incorrectly.
Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX
feature. Therefore, at xmit time, all VLANs should be in the skb head,
and none should be in the hwaccel area. This is done by:
static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
netdev_features_t features)
{
if (skb_vlan_tag_present(skb) &&
!vlan_hw_offload_capable(features, skb->vlan_proto))
skb = __vlan_hwaccel_push_inside(skb);
return skb;
}
But ocelot_port_inject_frame() handles things incorrectly:
ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb));
void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op)
{
(...)
if (vlan_tag)
ocelot_ifh_set_vlan_tci(ifh, vlan_tag);
(...)
}
The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head
is by calling:
static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
{
skb->vlan_present = 0;
}
which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get().
This means that ocelot, when it calls skb_vlan_tag_get(), sees
(and uses) a residual skb->vlan_tci, while the same VLAN tag is
_already_ in the skb head.
The trivial fix for double VLAN headers is to replace the content of
ocelot_ifh_port_set() with:
if (skb_vlan_tag_present(skb))
ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb));
but this would not be correct either, because, as mentioned,
vlan_hw_offload_capable() is false for us, so we'd be inserting dead
code and we'd always transmit packets with VID=0 in the injection frame
header.
I can't actually test the ocelot switchdev driver and rely exclusively
on code inspection, but I don't think traffic from 8021q uppers has ever
been injected properly, and not double-tagged. Thus I'm blaming the
introduction of VLAN fields in the injection header - early driver code.
As hinted at in the early conclusion, what we _want_ to happen for
VLAN transmission was already described once in commit 5ca721c54d86
("net: dsa: tag_ocelot: set the classified VLAN during xmit").
ocelot_xmit_get_vlan_info() intends to ensure that if the port through
which we're transmitting is under a VLAN-aware bridge, the outer VLAN
tag from the skb head is stripped from there and inserted into the
injection frame header (so that the packet is processed in hardware
through that actual VLAN). And in all other cases, the packet is sent
with VID=0 in the injection frame header, since the port is VLAN-unaware
and has logic to strip this VID on egress (making it invisible to the
wire).
Fixes: 08d02364b12f ("net: mscc: fix the injection header")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The bridge VLAN implementation w.r.t. VLAN protocol is described in
merge commit 1a0b20b25732 ("Merge branch 'bridge-next'"). We are only
sensitive to those VLAN tags whose TPID is equal to the bridge's
vlan_protocol. Thus, an 802.1ad VLAN should be treated as 802.1Q-untagged.
Add 3 tests which validate that:
- 802.1ad-tagged traffic is learned into the PVID of an 802.1Q-aware
bridge
- Double-tagged traffic is forwarded when just the PVID of the port is
present in the VLAN group of the ports
- Double-tagged traffic is not forwarded when the PVID of the port is
absent from the VLAN group of the ports
The test passes with both veth and ocelot.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A breakage in the felix DSA driver shows we do not have enough test
coverage. More generally, it is sufficiently special that it is likely
drivers will treat it differently.
This is not meant to be a full PTP test, it just makes sure that PTP
packets sent to the different addresses corresponding to their profiles
are received correctly. The local_termination selftest seemed like the
most appropriate place for this addition.
PTP RX/TX in some cases makes no sense (over a bridge) and this is why
$skip_ptp exists. And in others - PTP over a bridge port - the IP stack
needs convincing through the available bridge netfilter hooks to leave
the PTP packets alone and not stolen by the bridge rx_handler. It is
safe to assume that users have that figured out already. This is a
driver level test, and by using tcpdump, all that extra setup is out of
scope here.
send_non_ip() was an unfinished idea; written but never used.
Replace it with a more generic send_raw(), and send 3 PTP packet types
times 3 transports.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
xfail_on_veth() for this test is an incorrect approximation which gives
false positives and false negatives.
When local_termination fails with "reception succeeded, but should have failed",
it is because the DUT ($h2) accepts packets even when not configured as
promiscuous. This is not something specific to veth; even the bridge
behaves that way, but this is not captured by the xfail_on_veth test.
The IFF_UNICAST_FLT flag is not explicitly exported to user space, but
it can somewhat be determined from the interface's behavior. We have to
create a macvlan upper with a different MAC address. This forces a
dev_uc_add() call in the kernel. When the unicast filtering list is
not empty, but the device doesn't support IFF_UNICAST_FLT,
__dev_set_rx_mode() force-enables promiscuity on the interface, to
ensure correct behavior (that the requested address is received).
We can monitor the change in the promiscuity flag and infer from it
whether the device supports unicast filtering.
There is no equivalent thing for allmulti, unfortunately. We never know
what's hiding behind a device which has allmulti=off. Whether it will
actually perform RX multicast filtering of unknown traffic is a strong
"maybe". The bridge driver, for example, completely ignores the flag.
We'll have to keep the xfail behavior, but instead of XFAIL on just
veth, always XFAIL.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
behavior
Add more coverage to the local termination selftest as follows:
- 8021q upper of $h2
- 8021q upper of $h2, where $h2 is a port of a VLAN-unaware bridge
- 8021q upper of $h2, where $h2 is a port of a VLAN-aware bridge
- 8021q upper of VLAN-unaware br0, which is the upper of $h2
- 8021q upper of VLAN-aware br0, which is the upper of $h2
Especially the cases with traffic sent through the VLAN upper of a
VLAN-aware bridge port will be immediately relevant when we will start
transmitting PTP packets as an additional kind of traffic.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The current bridge() test is for packet reception on a VLAN-unaware
bridge. Some things are different enough with VLAN-aware bridges that
it's worth renaming this test into vlan_unaware_bridge(), and add a new
vlan_aware_bridge() test.
The two will share the same implementation: bridge() becomes a common
function, which receives $vlan_filtering as an argument. Rename it to
test_bridge() at the same time, because just bridge() pollutes the
global namespace and we cannot invoke the binary with the same name from
the iproute2 package currently.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are upcoming tests which verify the RX filtering of a bridge
(or bridge port), but under differing vlan_filtering conditions.
Since we currently print $h2 (the DUT) in the log_test() output, it
becomes necessary to make a further distinction between tests, to not
give the user the impression that the exact same thing is run twice.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In future changes we will want to subject the DUT, $h2, to additional
VLAN-tagged traffic. For that, we need to run the tests using $h1.100 as
a sending interface, rather than the currently hardcoded $h1.
Add a parameter to run_test() and modify its 2 callers to explicitly
pass $h1, as was implicit before.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This will be used in other subtests as well; make new macvlan_create()
and macvlan_destroy() functions.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tools/testing/selftests/drivers/net/ is not listed under
networking entries. Add it to NETWORKING DRIVERS.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240814142832.3473685-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The driver currently blindly deletes its cache of RSS cotexts and
ntuple filters when the ethtool channel count is changing. It also
deletes the ntuple filters cache when the default indirection table
is changing.
The core will not allow ethtool channels to drop below any that
have been configured as ntuple destinations since this commit from 2022:
47f3ecf4763d ("ethtool: Fail number of channels change when it conflicts with rxnfc")
So there is absolutely no need to delete the ntuple filters and
RSS contexts when changing ethtool channels.
It is also unnecessary to delete ntuple filters when the default
RSS indirection table is changing.
Remove bnxt_clear_usr_fltrs() and bnxt_clear_rss_ctxis() from the
ethtool ops and change them to static functions.
This bug will cause confusion to the end user and causes failure when
running the rss_ctx.py selftest.
Fixes: 1018319f949c ("bnxt_en: Invalidate user filters when needed")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20240725111912.7bc17cf6@kernel.org/
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20240814225429.199280-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
During suspend/resume the following BUG was hit:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in: bluetooth ecdh_generic ecc libaes
CPU: 1 PID: 1282 Comm: rtcwake Not tainted
6.10.0-rc3-00732-gc8bd1f7f3e61 #15240
Hardware name: Generic DT based system
PC is at dql_completed+0x270/0x2cc
LR is at __free_old_xmit+0x120/0x198
pc : [<c07ffa54>] lr : [<c0c42bf4>] psr: 80000013
...
Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 43a4406a DAC: 00000051
...
Process rtcwake (pid: 1282, stack limit = 0xfbc21278)
Stack: (0xe0805e80 to 0xe0806000)
...
Call trace:
dql_completed from __free_old_xmit+0x120/0x198
__free_old_xmit from free_old_xmit+0x44/0xe4
free_old_xmit from virtnet_poll_tx+0x88/0x1b4
virtnet_poll_tx from __napi_poll+0x2c/0x1d4
__napi_poll from net_rx_action+0x140/0x2b4
net_rx_action from handle_softirqs+0x11c/0x350
handle_softirqs from call_with_stack+0x18/0x20
call_with_stack from do_softirq+0x48/0x50
do_softirq from __local_bh_enable_ip+0xa0/0xa4
__local_bh_enable_ip from virtnet_open+0xd4/0x21c
virtnet_open from virtnet_restore+0x94/0x120
virtnet_restore from virtio_device_restore+0x110/0x1f4
virtio_device_restore from dpm_run_callback+0x3c/0x100
dpm_run_callback from device_resume+0x12c/0x2a8
device_resume from dpm_resume+0x12c/0x1e0
dpm_resume from dpm_resume_end+0xc/0x18
dpm_resume_end from suspend_devices_and_enter+0x1f0/0x72c
suspend_devices_and_enter from pm_suspend+0x270/0x2a0
pm_suspend from state_store+0x68/0xc8
state_store from kernfs_fop_write_iter+0x10c/0x1cc
kernfs_fop_write_iter from vfs_write+0x2b0/0x3dc
vfs_write from ksys_write+0x5c/0xd4
ksys_write from ret_fast_syscall+0x0/0x54
Exception stack(0xe8bf1fa8 to 0xe8bf1ff0)
...
---[ end trace 0000000000000000 ]---
After virtnet_napi_enable() is called, the following path is hit:
__napi_poll()
-> virtnet_poll()
-> virtnet_poll_cleantx()
-> netif_tx_wake_queue()
That wakes the TX queue and allows skbs to be submitted and accounted by
BQL counters.
Then netdev_tx_reset_queue() is called that resets BQL counters and
eventually leads to the BUG in dql_completed().
Move virtnet_napi_tx_enable() what does BQL counters reset before RX
napi enable to avoid the issue.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/netdev/e632e378-d019-4de7-8f13-07c572ab37a9@samsung.com/
Fixes: c8bd1f7f3e61 ("virtio_net: add support for Byte Queue Limits")
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20240814122500.1710279-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from wireless and netfilter
Current release - regressions:
- udp: fall back to software USO if IPv6 extension headers are
present
- wifi: iwlwifi: correctly lookup DMA address in SG table
Current release - new code bugs:
- eth: mlx5e: fix queue stats access to non-existing channels splat
Previous releases - regressions:
- eth: mlx5e: take state lock during tx timeout reporter
- eth: mlxbf_gige: disable RX filters until RX path initialized
- eth: igc: fix reset adapter logics when tx mode change
Previous releases - always broken:
- tcp: update window clamping condition
- netfilter:
- nf_queue: drop packets with cloned unconfirmed conntracks
- nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
- vsock: fix recursive ->recvmsg calls
- dsa: vsc73xx: fix MDIO bus access and PHY opera
- eth: gtp: pull network headers in gtp_dev_xmit()
- eth: igc: fix packet still tx after gate close by reducing i226 MAC
retry buffer
- eth: mana: fix RX buf alloc_size alignment and atomic op panic
- eth: hns3: fix a deadlock problem when config TC during resetting"
* tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
net: hns3: use correct release function during uninitialization
net: hns3: void array out of bound when loop tnl_num
net: hns3: fix a deadlock problem when config TC during resetting
net: hns3: use the user's cfg after reset
net: hns3: fix wrong use of semaphore up
selftests: net: lib: kill PIDs before del netns
pse-core: Conditionally set current limit during PI regulator registration
net: thunder_bgx: Fix netdev structure allocation
net: ethtool: Allow write mechanism of LPL and both LPL and EPL
vsock: fix recursive ->recvmsg calls
selftest: af_unix: Fix kselftest compilation warnings
netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
netfilter: nf_tables: Introduce nf_tables_getobj_single
netfilter: nf_tables: Audit log dump reset after the fact
selftests: netfilter: add test for br_netfilter+conntrack+queue combination
netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
netfilter: flowtable: initialise extack before use
netfilter: nfnetlink: Initialise extack before use in ACKs
netfilter: allow ipv6 fragments to arrive on different devices
tcp: Update window clamping condition
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"Two regression fixes:
- fix atomisp support for ISP2400
- fix dvb-usb regression for TeVii s480 dual DVB-S2 S660 board"
* tag 'media/v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: atomisp: Fix streaming no longer working on BYT / ISP2400 devices
media: Revert "media: dvb-usb: Fix unexpected infinite loop in dvb_usb_read_remote_control()"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Revert a recent change to sense data generation.
Sense data can be in either fixed format or descriptor format.
The D_SENSE bit in the Control mode page controls which format to
generate. All places but one respected the D_SENSE bit.
The recent change fixed the one place that didn't respect the D_SENSE
bit. However, it turns out that hdparm, hddtemp and udisks
(incorrectly) assumes sense data in descriptor format.
Therefore, even while the change was technically correct, revert it,
since even if these user space programs are fixed to (correctly) look
at the format type before parsing the data, older versions of these
tools will be around roughly forever.
* tag 'ata-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
Revert "ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Ignores ifindex for types other than mcast/linklocal in ipv6 frag
reasm, from Tom Hughes.
2) Initialize extack for begin/end netlink message marker in batch,
from Donald Hunter.
3) Initialize extack for flowtable offload support, also from Donald.
4) Dropped packets with cloned unconfirmed conntracks in nfqueue,
later it should be possible to explore lookup after reinject but
Florian prefers this approach at this stage. From Florian Westphal.
5) Add selftest for cloned unconfirmed conntracks in nfqueue for
previous update.
6) Audit after filling netlink header successfully in object dump,
from Phil Sutter.
7-8) Fix concurrent dump and reset which could result in underflow
counter / quota objects.
netfilter pull request 24-08-15
* tag 'nf-24-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
netfilter: nf_tables: Introduce nf_tables_getobj_single
netfilter: nf_tables: Audit log dump reset after the fact
selftests: netfilter: add test for br_netfilter+conntrack+queue combination
netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
netfilter: flowtable: initialise extack before use
netfilter: nfnetlink: Initialise extack before use in ACKs
netfilter: allow ipv6 fragments to arrive on different devices
====================
Link: https://patch.msgid.link/20240814222042.150590-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Jijie Shao says:
====================
There are some bugfix for the HNS3 ethernet driver
====================
Link: https://patch.msgid.link/20240813141024.1707252-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
pci_request_regions is called to apply for PCI I/O and memory resources
when the driver is initialized, Therefore, when the driver is uninstalled,
pci_release_regions should be used to release PCI I/O and memory resources
instead of pci_release_mem_regions is used to release memory reasouces
only.
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When query reg inf of SSU, it loops tnl_num times. However, tnl_num comes
from hardware and the length of array is a fixed value. To void array out
of bound, make sure the loop time is not greater than the length of array
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When config TC during the reset process, may cause a deadlock, the flow is
as below:
pf reset start
│
▼
......
setup tc │
│ ▼
▼ DOWN: napi_disable()
napi_disable()(skip) │
│ │
▼ ▼
...... ......
│ │
▼ │
napi_enable() │
▼
UINIT: netif_napi_del()
│
▼
......
│
▼
INIT: netif_napi_add()
│
▼
...... global reset start
│ │
▼ ▼
UP: napi_enable()(skip) ......
│ │
▼ ▼
...... napi_disable()
In reset process, the driver will DOWN the port and then UINIT, in this
case, the setup tc process will UP the port before UINIT, so cause the
problem. Adds a DOWN process in UINIT to fix it.
Fixes: bb6b94a896d4 ("net: hns3: Add reset interface implementation in client")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Consider the followed case that the user change speed and reset the net
interface. Before the hw change speed successfully, the driver get old
old speed from hw by timer task. After reset, the previous speed is config
to hw. As a result, the new speed is configed successfully but lost after
PF reset. The followed pictured shows more dirrectly.
+------+ +----+ +----+
| USER | | PF | | HW |
+---+--+ +-+--+ +-+--+
| ethtool -s 100G | |
+------------------>| set speed 100G |
| +--------------------->|
| | set successfully |
| |<---------------------+---+
| |query cfg (timer task)| |
| +--------------------->| | handle speed
| | return 200G | | changing event
| ethtool --reset |<---------------------+ | (100G)
+------------------>| cfg previous speed |<--+
| | after reset (200G) |
| +--------------------->|
| | +---+
| |query cfg (timer task)| |
| +--------------------->| | handle speed
| | return 100G | | changing event
| |<---------------------+ | (200G)
| | |<--+
| |query cfg (timer task)|
| +--------------------->|
| | return 200G |
| |<---------------------+
| | |
v v v
This patch save new speed if hw change speed successfully, which will be
used after reset successfully.
Fixes: 2d03eacc0b7e ("net: hns3: Only update mac configuation when necessary")
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, if hns3 PF or VF FLR reset failed after five times retry,
the reset done process will directly release the semaphore
which has already released in hclge_reset_prepare_general.
This will cause down operation fail.
So this patch fixes it by adding reset state judgement. The up operation is
only called after successful PF FLR reset.
Fixes: 8627bdedc435 ("net: hns3: refactor the precedure of PF FLR")
Fixes: f28368bb4542 ("net: hns3: refactor the procedure of VF FLR")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When deleting netns, it is possible to still have some tasks running,
e.g. background tasks like tcpdump running in the background, not
stopped because the test has been interrupted.
Before deleting the netns, it is then safer to kill all attached PIDs,
if any. That should reduce some noises after the end of some tests, and
help with the debugging of some issues. That's why this modification is
seen as a "fix".
Fixes: 25ae948b4478 ("selftests/net: add lib.sh")
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20240813-upstream-net-20240813-selftests-net-lib-kill-v1-1-27b689b248b8@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Fix an issue where `devm_regulator_register()` would fail for PSE
controllers that do not support current limit control, such as simple
GPIO-based controllers like the podl-pse-regulator. The
`REGULATOR_CHANGE_CURRENT` flag and `max_uA` constraint are now
conditionally set only if the `pi_set_current_limit` operation is
supported. This change prevents the regulator registration routine from
attempting to call `pse_pi_set_current_limit()`, which would return
`-EOPNOTSUPP` and cause the registration to fail.
Fixes: 4a83abcef5f4f ("net: pse-pd: Add new power limit get and set c33 features")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Tested-by: Kyle Swenson <kyle.swenson@est.tech>
Link: https://patch.msgid.link/20240813073719.2304633-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Commit 94833addfaba ("net: thunderx: Unembed netdev structure") had
a go at dynamically allocating the netdev structures for the thunderx_bgx
driver. This change results in my ThunderX box catching fire (to be fair,
it is what it does best).
The issues with this change are that:
- bgx_lmac_enable() is called *after* bgx_acpi_register_phy() and
bgx_init_of_phy(), both expecting netdev to be a valid pointer.
- bgx_init_of_phy() populates the MAC addresses for *all* LMACs
attached to a given BGX instance, and thus needs netdev for each of
them to have been allocated.
There is a few things to be said about how the driver mixes LMAC and
BGX states which leads to this sorry state, but that's beside the point.
To address this, go back to a situation where all netdev structures
are allocated before the driver starts relying on them, and move the
freeing of these structures to driver removal. Someone brave enough
can always go and restructure the driver if they want.
Fixes: 94833addfaba ("net: thunderx: Unembed netdev structure")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20240812141322.1742918-1-maz@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
CMIS 5.2 standard section 9.4.2 defines four types of firmware update
supported mechanism: None, only LPL, only EPL, both LPL and EPL.
Currently, only LPL (Local Payload) type of write firmware block is
supported. However, if the module supports both LPL and EPL the flashing
process wrongly fails for no supporting LPL.
Fix that, by allowing the write mechanism to be LPL or both LPL and
EPL.
Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20240812140824.3718826-1-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
After a vsock socket has been added to a BPF sockmap, its prot->recvmsg
has been replaced with vsock_bpf_recvmsg(). Thus the following
recursiion could happen:
vsock_bpf_recvmsg()
-> __vsock_recvmsg()
-> vsock_connectible_recvmsg()
-> prot->recvmsg()
-> vsock_bpf_recvmsg() again
We need to fix it by calling the original ->recvmsg() without any BPF
sockmap logic in __vsock_recvmsg().
Fixes: 634f1a7110b4 ("vsock: support sockmap")
Reported-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com
Tested-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com
Cc: Bobby Eshleman <bobby.eshleman@bytedance.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20240812022153.86512-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.11
We have few fixes to drivers. The most important here is a fix for
iwlwifi which caused major slowdowns for several users.
* tag 'wireless-2024-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: iwlwifi: correctly lookup DMA address in SG table
wifi: mt76: mt7921: fix NULL pointer access in mt7921_ipv6_addr_change
wifi: brcmfmac: cfg80211: Handle SSID based pmksa deletion
wifi: rtlwifi: rtl8192du: Initialise value32 in _rtl92du_init_queue_reserved_page
wifi: ath12k: use 128 bytes aligned iova in transmit path for WCN7850
====================
Link: https://patch.msgid.link/20240814171606.E14A0C116B1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Change expected_buf from (const void *) to (const char *)
in function __recvpair().
This change fixes the below warnings during test compilation:
```
In file included from msg_oob.c:14:
msg_oob.c: In function ‘__recvpair’:
../../kselftest_harness.h:106:40: warning: format ‘%s’ expects argument
of type ‘char *’,but argument 6 has type ‘const void *’ [-Wformat=]
../../kselftest_harness.h:101:17: note: in expansion of macro ‘__TH_LOG’
msg_oob.c:235:17: note: in expansion of macro ‘TH_LOG’
../../kselftest_harness.h:106:40: warning: format ‘%s’ expects argument
of type ‘char *’,but argument 6 has type ‘const void *’ [-Wformat=]
../../kselftest_harness.h:101:17: note: in expansion of macro ‘__TH_LOG’
msg_oob.c:259:25: note: in expansion of macro ‘TH_LOG’
```
Fixes: d098d77232c3 ("selftest: af_unix: Add msg_oob.c.")
Signed-off-by: Abhinav Jain <jain.abhinav177@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240814080743.1156166-1-jain.abhinav177@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- extend tree-checker verification of directory item type
- fix regression in page/folio and extent state tracking in xarray, the
dirty status can get out of sync and can cause problems e.g. a hang
- in send, detect last extent and allow to clone it instead of sending
it as write, reduces amount of data transferred in the stream
- fix checking extent references when cleaning deleted subvolumes
- fix one more case in the extent map shrinker, let it run only in the
kswapd context so it does not cause latency spikes during other
operations
* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix invalid mapping of extent xarray state
btrfs: send: allow cloning non-aligned extent if it ends at i_size
btrfs: only run the extent map shrinker from kswapd tasks
btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
btrfs: check delayed refs when we're checking if a ref exists
|
|
Objects' dump callbacks are not concurrency-safe per-se with reset bit
set. If two CPUs perform a reset at the same time, at least counter and
quota objects suffer from value underrun.
Prevent this by introducing dedicated locking callbacks for nfnetlink
and the asynchronous dump handling to serialize access.
Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Outsource the reply skb preparation for non-dump getrule requests into a
distinct function. Prep work for object reset locking.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In theory, dumpreset may fail and invalidate the preceeding log message.
Fix this and use the occasion to prepare for object reset locking, which
benefits from a few unrelated changes:
* Add an early call to nfnetlink_unicast if not resetting which
effectively skips the audit logging but also unindents it.
* Extract the table's name from the netlink attribute (which is verified
via earlier table lookup) to not rely upon validity of the looked up
table pointer.
* Do not use local variable family, it will vanish.
Fixes: 8e6cf365e1d5 ("audit: log nftables configuration change events")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Trigger cloned skbs leaving softirq protection.
This triggers splat without the preceeding change
("netfilter: nf_queue: drop packets with cloned unconfirmed
conntracks"):
WARNING: at net/netfilter/nf_conntrack_core.c:1198 __nf_conntrack_confirm..
because local delivery and forwarding will race for confirmation.
Based on a reproducer script from Yi Chen.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Conntrack assumes an unconfirmed entry (not yet committed to global hash
table) has a refcount of 1 and is not visible to other cores.
With multicast forwarding this assumption breaks down because such
skbs get cloned after being picked up, i.e. ct->use refcount is > 1.
Likewise, bridge netfilter will clone broad/mutlicast frames and
all frames in case they need to be flood-forwarded during learning
phase.
For ip multicast forwarding or plain bridge flood-forward this will
"work" because packets don't leave softirq and are implicitly
serialized.
With nfqueue this no longer holds true, the packets get queued
and can be reinjected in arbitrary ways.
Disable this feature, I see no other solution.
After this patch, nfqueue cannot queue packets except the last
multicast/broadcast packet.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fix missing initialisation of extack in flow offload.
Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add missing extack initialisation when ACKing BATCH_BEGIN and BATCH_END.
Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages")
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"s390:
- Fix failure to start guests with kvm.use_gisa=0
- Panic if (un)share fails to maintain security.
ARM:
- Use kvfree() for the kvmalloc'd nested MMUs array
- Set of fixes to address warnings in W=1 builds
- Make KVM depend on assembler support for ARMv8.4
- Fix for vgic-debug interface for VMs without LPIs
- Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest
- Minor code / comment cleanups for configuring PAuth traps
- Take kvm->arch.config_lock to prevent destruction / initialization
race for a vCPU's CPUIF which may lead to a UAF
x86:
- Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
- Fix smatch issues
- Small cleanups
- Make x2APIC ID 100% readonly
- Fix typo in uapi constant
Generic:
- Use synchronize_srcu_expedited() on irqfd shutdown"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
KVM: eventfd: Use synchronize_srcu_expedited() on shutdown
KVM: selftests: Add a testcase to verify x2APIC is fully readonly
KVM: x86: Make x2APIC ID 100% readonly
KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()
KVM: SVM: Fix an error code in sev_gmem_post_populate()
KVM: SVM: Fix uninitialized variable bug
KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
KVM: arm64: Tidying up PAuth code in KVM
KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
s390/uv: Panic for set and remove shared access UVC errors
KVM: s390: fix validity interception issue when gisa is switched off
docs: KVM: Fix register ID of SPSR_FIQ
KVM: arm64: vgic: fix unexpected unlock sparse warnings
KVM: arm64: fix kdoc warnings in W=1 builds
KVM: arm64: fix override-init warnings in W=1 builds
...
|
|
Commit 264640fc2c5f4 ("ipv6: distinguish frag queues by device
for multicast and link-local packets") modified the ipv6 fragment
reassembly logic to distinguish frag queues by device for multicast
and link-local packets but in fact only the main reassembly code
limits the use of the device to those address types and the netfilter
reassembly code uses the device for all packets.
This means that if fragments of a packet arrive on different interfaces
then netfilter will fail to reassemble them and the fragments will be
expired without going any further through the filters.
Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Tom Hughes <tom@compton.nu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
"INVALID" is misspelt in "SEV_RET_INAVLID_CONFIG". Since this is part of
the UAPI, keep the current definition and add a new one with the fix.
Fix-suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Amit Shah <amit.shah@amd.com>
Message-ID: <20240814083113.21622-1-amit@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation. Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes. Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC. And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.
Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0. But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.
Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.
Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux fixes from Paul Moore:
- Fix a xperms counting problem where we adding to the xperms count
even if we failed to add the xperm.
- Propogate errors from avc_add_xperms_decision() back to the caller so
that we can trigger the proper cleanup and error handling.
- Revert our use of vma_is_initial_heap() in favor of our older logic
as vma_is_initial_heap() doesn't correctly handle the no-heap case
and it is causing issues with the SELinux process/execheap access
control. While the older SELinux logic may not be perfect, it
restores the expected user visible behavior.
Hopefully we will be able to resolve the problem with the
vma_is_initial_heap() macro with the mm folks, but we need to fix
this in the meantime.
* tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: revert our use of vma_is_initial_heap()
selinux: add the processing of the failure of avc_add_xperms_decision()
selinux: fix potential counting error in avc_add_xperms_decision()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"VFS:
- Fix the name of file lease slab cache. When file leases were split
out of file locks the name of the file lock slab cache was used for
the file leases slab cache as well.
- Fix a type in take_fd() helper.
- Fix infinite directory iteration for stable offsets in tmpfs.
- When the icache is pruned all reclaimable inodes are marked with
I_FREEING and other processes that try to lookup such inodes will
block.
But some filesystems like ext4 can trigger lookups in their inode
evict callback causing deadlocks. Ext4 does such lookups if the
ea_inode feature is used whereby a separate inode may be used to
store xattrs.
Introduce I_LRU_ISOLATING which pins the inode while its pages are
reclaimed. This avoids inode deletion during inode_lru_isolate()
avoiding the deadlock and evict is made to wait until
I_LRU_ISOLATING is done.
netfs:
- Fault in smaller chunks for non-large folio mappings for
filesystems that haven't been converted to large folios yet.
- Fix the CONFIG_NETFS_DEBUG config option. The config option was
renamed a short while ago and that introduced two minor issues.
First, it depended on CONFIG_NETFS whereas it wants to depend on
CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
does. Second, the documentation for the config option wasn't fixed
up.
- Revert the removal of the PG_private_2 writeback flag as ceph is
using it and fix how that flag is handled in netfs.
- Fix DIO reads on 9p. A program watching a file on a 9p mount
wouldn't see any changes in the size of the file being exported by
the server if the file was changed directly in the source
filesystem. Fix this by attempting to read the full size specified
when a DIO read is requested.
- Fix a NULL pointer dereference bug due to a data race where a
cachefiles cookies was retired even though it was still in use.
Check the cookie's n_accesses counter before discarding it.
nsfs:
- Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
the kernel is writing to userspace.
pidfs:
- Prevent the creation of pidfds for kthreads until we have a
use-case for it and we know the semantics we want. It also confuses
userspace why they can get pidfds for kthreads.
squashfs:
- Fix an unitialized value bug reported by KMSAN caused by a
corrupted symbolic link size read from disk. Check that the
symbolic link size is not larger than expected"
* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Squashfs: sanity check symbolic link size
9p: Fix DIO read through netfs
vfs: Don't evict inode under the inode lru traversing context
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
file: fix typo in take_fd() comment
pidfd: prevent creation of pidfds for kthreads
netfs: clean up after renaming FSCACHE_DEBUG config
libfs: fix infinite directory reads for offset dir
nsfs: fix ioctl declaration
fs/netfs/fscache_cookie: add missing "n_accesses" check
filelock: fix name of file_lease slab cache
netfs: Fault in smaller chunks for non-large folio mappings
|