Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull file_remove_privs() fix from Christian Brauner:
"As part of Stefan's and Jens' work to add async buffered write
support to xfs we refactored file_remove_privs() and added
__file_remove_privs() to avoid calling __remove_privs() when
IOCB_NOWAIT is passed.
While debugging a recent performance regression report I found that
during review we missed that commit faf99b563558 ("fs: add
__remove_file_privs() with flags parameter") accidently changed
behavior when dentry_needs_remove_privs() returns zero.
Before the commit it would still call inode_has_no_xattr() setting
the S_NOSEC bit and thereby avoiding even calling into
dentry_needs_remove_privs() the next time this function is called.
After that commit inode_has_no_xattr() would only be called if
__remove_privs() had to be called.
Restore the old behavior. This is likely the cause of the performance
regression"
* tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: __file_remove_privs(): restore call to inode_has_no_xattr()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-08-22
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Unlock on error in mlx5_sriov_enable()
net/mlx5e: Fix use after free in mlx5e_fs_init()
net/mlx5e: kTLS, Use _safe() iterator in mlx5e_tls_priv_tx_list_cleanup()
net/mlx5: unlock on error path in esw_vfs_changed_event_handler()
net/mlx5e: Fix wrong tc flag used when set hw-tc-offload off
net/mlx5e: TC, Add missing policer validation
net/mlx5e: Fix wrong application of the LRO state
net/mlx5: Avoid false positive lockdep warning by adding lock_class_key
net/mlx5: Fix cmd error logging for manage pages cmd
net/mlx5: Disable irq when locking lag_lock
net/mlx5: Eswitch, Fix forwarding decision to uplink
net/mlx5: LAG, fix logic over MLX5_LAG_FLAG_NDEVS_READY
net/mlx5e: Properly disable vlan strip on non-UL reps
====================
Link: https://lore.kernel.org/r/20220822195917.216025-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
ice: xsk: reduced queue count fixes
Maciej Fijalkowski says:
this small series is supposed to fix the issues around AF_XDP usage with
reduced queue count on interface. Due to the XDP rings setup, some
configurations can result in sockets not seeing traffic flowing. More
about this in description of patch 2.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: xsk: use Rx ring's XDP ring when picking NAPI context
ice: xsk: prohibit usage of non-balanced queue id
====================
Link: https://lore.kernel.org/r/20220822163257.2382487-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Wei Fang says:
====================
add interface mode select and RMII
From: Wei Fang <wei.fang@nxp.com>
The patches add the below feature support for both TJA1100 and
TJA1101 PHYs cards:
- Add MII and RMII mode support.
- Add REF_CLK input/output support for RMII mode.
====================
Link: https://lore.kernel.org/r/20220822015949.1569969-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add below features support for both TJA1100 and TJA1101 cards:
- Add MII and RMII mode support.
- Add REF_CLK input/output support for RMII mode.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TJA110x REF_CLK can be configured as interface reference clock
intput or output when the RMII mode enabled. This patch add the
property to make the REF_CLK can be configurable.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Petr Machata says:
====================
mlxsw: Introduce modular system support by minimal driver
Vadim Pasternak writes:
This patchset adds line cards support in mlxsw_minimal, which is used
for monitoring purposes on BMC systems. The BMC is connected to the
ASIC over I2C bus, unlike the host CPU that is connected to the ASIC
via PCI bus.
The BMC system needs to be notified whenever line cards become active
or inactive, so that, for example, netdevs will be registered /
unregistered by mlxsw_minimal. However, traps cannot be generated
towards the BMC over the I2C bus. To overcome that, the I2C bus driver
(i.e., mlxsw_i2c) registers an handler for an IRQ that is fired upon
specific system wide changes, like line card activation and
deactivation.
The generated event is handled by mlxsw_core, which checks whether
anything changed in the state of available line cards. If a line card
becomes active or inactive, interested parties such as mlxsw_minimal
are notified via their registered line card event callback.
Patch set overview:
Patches #1 is preparations.
Patches #2-#3 extend mlxsw_core with an infrastructure to handle the
previously mentioned system events.
Patch #4 extends the I2C bus driver to register an handler for the IRQ
fired upon specific system wide changes.
Patches #5-#8 gradually add line cards support in mlxsw_minimal by
dynamically registering / unregistering netdevs for ports found on
line cards, whenever a line card becomes active / inactive.
====================
Link: https://lore.kernel.org/r/cover.1661093502.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement line card operation callbacks got_active() / got_inactive().
The purpose of these callback to create / remove line card ports after
line card is getting active / inactive.
Implement line ports_remove_selected() callback to support line card
un-provisioning flow through 'devlink'.
Add line card operation registration and de-registration APIs.
Add module offset for line card. Offset for main board iz zero.
For line card in slot #n offset is calculated as (#n - 1) multiplied by
maximum modules number.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The interfaces for ports found on line card are created and removed
dynamically after line card is getting active or inactive.
Introduce per line card array with module to port mapping.
For each port get 'slot_index' through PMLP register and set port
mapping for the relevant [slot_index][module] entry.
Split module and port allocation into separate routines.
Split per line card port creation and removing into separate routines.
Motivation to re-use these routines for line card operations.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Perform ports allocation in a separate routine.
Motivation is to re-use this routine for ports found on line cards.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add 'slot_index' field to port structure.
Replace zero slot_index argument with 'slot_index' in 'ethtool'
related APIs.
Add 'slot_index' argument to port initialization and
de-initialization related APIs.
Motivation is to prepare minimal driver for modular system support.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend i2c bus driver with interrupt handler to support system specific
hotplug events, related to line card state change.
Provide system IRQ line for interrupt handler. IRQ line Id could be
provided through the platform data if available, or could be set to the
default value.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add line card system event handler. Register it with core. It is
triggered by system interrupts raised from chassis programmable logic
devices to CPU. The purpose is to handle line card state changes over
I2C bus.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The purpose of system event handler is to handle system interrupts.
Such interrupts are raised to CPU from system programmable logic
devices, upon specific system wide changes, like line card activation
and deactivation.
The purpose is to create an alternative to trap mechanism, which
delivers these events to driver over PCI bus, but not available for
the driver working over I2C bus.
Mechanism is system dependent and applicable only for the systems
equipped with programmable devices with custom logic.
Add APIs for event handler registration and un-registration and API
which should be invoked from the registered callbacks when system
interrupt is raised to CPU.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, each line card is initialized using the following steps:
1. Initializing its various fields (e.g., slot index).
2. Creating the corresponding devlink object.
3. Enabling events (i.e., traps) for changes in line card status.
4. Querying and processing line card status.
Unlike traps, the IRQ that notifies the CPU about line card status
changes cannot be enabled / disabled on a per line card basis.
If a handler is registered before the line cards are initialized, the
handler risks accessing uninitialized memory.
On the other hand, if the handler is registered after initialization,
we risk missing events. For example, in step 4, the driver might see
that a line card is in ready state and will tell the device to enable
it. When enablement is done, the line card will be activated and the
IRQ will be triggered. Since a handler was not registered, the event
will be missed.
Solve this by splitting the initialization sequence into two steps
(1-2 and 3-4). In a subsequent patch, the handler will be registered
between both steps.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Provide a bit of a brain dump of netlink related information
as documentation. Hopefully this will be useful to people
trying to navigate implementing YAML based parsing in languages
we won't be able to help with.
I started writing this doc while trying to figure out what
it'd take to widen the applicability of YAML to good old rtnl,
but the doc grew beyond that as it usually happens.
In all honesty a lot of this information is new to me as I usually
follow the "copy an existing example, drink to forget" process
of writing netlink user space, so reviews will be much appreciated.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20220819200221.422801-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Subsequent patch will render the kdoc from
include/uapi/linux/netlink.h into Documentation.
We need to fix the warnings. While at it move
the comments on struct nlmsghdr to a proper
kdoc comment.
Link: https://lore.kernel.org/r/20220819200221.422801-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Michael Chan says:
====================
bnxt_en: Bug fixes
This series includes 2 fixes for regressions introduced by the XDP
multi-buffer feature, 1 devlink reload bug fix, and 1 SRIOV resource
accounting bug fix.
====================
Link: https://lore.kernel.org/r/1661180814-19350-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
LRO/GRO_HW should be disabled if there is an attached XDP program.
BNXT_FLAG_TPA is the current setting of the LRO/GRO_HW. Using
BNXT_FLAG_TPA to disable LRO/GRO_HW will cause these features to be
permanently disabled once they are disabled.
Fixes: 1dc4c557bfed ("bnxt: adding bnxt_xdp_build_skb to build skb from multibuffer xdp_buff")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are 2 issues:
1. We should decrement hw_resc->max_nqs instead of hw_resc->max_irqs
with the number of NQs assigned to the VFs. The IRQs are fixed
on each function and cannot be re-assigned. Only the NQs are being
assigned to the VFs.
2. vf_msix is the total number of NQs to be assigned to the VFs. So
we should decrement vf_msix from hw_resc->max_nqs.
Fixes: b16b68918674 ("bnxt_en: Add SR-IOV support for 57500 chips.")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add missing devlink_set_features() API for callbacks reload_down
and reload_up to function.
Fixes: 228ea8c187d8 ("bnxt_en: implement devlink dev reload driver_reinit")
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Using BNXT_PAGE_MODE_BUF_SIZE + offset as buffer length value is not
sufficient when running single buffer XDP programs doing redirect
operations. The stack will complain on missing skb tail room. Fix it
by using PAGE_SIZE when calling xdp_init_buff() for single buffer
programs.
Fixes: b231c3f3414c ("bnxt: refactor bnxt_rx_xdp to separate xdp_init_buff/xdp_prepare_buff")
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case ftmac100 is used with a DSA switch, Linux wants to set MTU
to 1504 to accommodate for DSA overhead. With the default max_mtu
it leads to the error message:
ftmac100 92000000.mac eth0: error -22 setting MTU to 1504 to include DSA overhead
ftmac100 supports packet length 1518 (MAX_PKT_SIZE constant), so it is
safe to report it in max_mtu.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220821160844.474277-1-saproj@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Address learning should initially be turned off by the driver for port
operation in standalone mode, then the DSA core handles changes to it
via ds->ops->port_bridge_flags().
Leaving address learning enabled while ports are standalone breaks any
kind of communication which involves port B receiving what port A has
sent. Notably it breaks the ksz9477 driver used with a (non offloaded,
ports act as if standalone) bonding interface in active-backup mode,
when the ports are connected together through external switches, for
redundancy purposes.
This fixes a major design flaw in the ksz9477 and ksz8795 drivers, which
unconditionally leave address learning enabled even while ports operate
as standalone.
Fixes: b987e98e50ab ("dsa: add DSA switch driver for Microchip KSZ9477")
Link: https://lore.kernel.org/netdev/CAFZh4h-JVWt80CrQWkFji7tZJahMfOToUJQgKS5s0_=9zzpvYQ@mail.gmail.com/
Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220818164809.3198039-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Some interesting background to the current patchset:
It turned out that the fldw instruction (which loads a 32-bit word
from memory into one half of a FP register) failed on unaligned
addresses and even trashed some other random FP register instead. It's
a trivial one-liner fix in the exception handler but this failure
dates back to the very beginnings of the parisc-port. It's strange
that it was never noticed before.
Another patch fixes an annoyance noticed by Randy Dunlap. Running
"make ARCH=parisc64 randconfig" always returned a 32-bit config,
although one would expect a 64-bit config. Masahiro Yamada suggested
to mimik sparc Kconfig code, which fixed the issue nicely. This
allowed to drop some compiler build checks too.
Third, it's possible to build an optimized 32-bit kernel for PA8X00
(64-bit) CPUs, which then wouldn't start on 32-bit-only (PA1.x)
machines. I've added a bootup check which prevents that and which
prints a message to the console. This can be tested with qemu, which
currently only supports 32-bit emulation.
The other patches are usual clean-up stuff like added return value
checks and typo fixes in comments.
Summary:
- Fix emulation of fldw instruction on unaligned addresses
- Fix "make ARCH=parisc64 randconfig" to return a 64-bit config
- Prevent boot if trying to boot a 32-bit kernel compiled for PA8X00
CPUs on 32-bit only machines
- ccio-dma: Handle kmalloc failure in ccio_init_resources()"
* tag 'parisc-for-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Add runtime check to prevent PA2.0 kernels on PA1.x machines
parisc: ccio-dma: Handle kmalloc failure in ccio_init_resources()
parisc: led: Move from strlcpy with unused retval to strscpy
parisc: ccio-dma: Fix typo in comment
Revert "parisc: Show error if wrong 32/64-bit compiler is being used"
parisc: Make CONFIG_64BIT available for ARCH=parisc64 only
parisc: Fix exception handler for fldw and fstw instructions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Thirteen fixes, almost all for MM.
Seven of these are cc:stable and the remainder fix up the changes
which went into this -rc cycle"
* tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
kprobes: don't call disarm_kprobe() for disabled kprobes
mm/shmem: shmem_replace_page() remember NR_SHMEM
mm/shmem: tmpfs fallocate use file_modified()
mm/shmem: fix chattr fsflags support in tmpfs
mm/hugetlb: support write-faults in shared mappings
mm/hugetlb: fix hugetlb not supporting softdirty tracking
mm/uffd: reset write protection when unregister with wp-mode
mm/smaps: don't access young/dirty bit if pte unpresent
mm: add DEVICE_ZONE to FOR_ALL_ZONES
kernel/sys_ni: add compat entry for fadvise64_64
mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
Revert "zram: remove double compression logic"
get_maintainer: add Alan to .get_maintainer.ignore
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit fixes from Shuah Khan:
"Fix for a mmc test and to load .kunit_test_suites section when
CONFIG_KUNIT=m, and not just when KUnit is built-in"
* tag 'linux-kselftest-kunit-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m
mmc: sdhci-of-aspeed: test: Fix dependencies when KUNIT=m
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fixes from Shuah Khan:
"Fixes to vm and sgx test builds"
* tag 'linux-kselftest-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/vm: fix inability to build any vm tests
selftests/sgx: Ignore OpenSSL 3.0 deprecated functions warning
|
|
TPROXY is only allowed from prerouting, but nft_tproxy doesn't check this.
This fixes a crash (null dereference) when using tproxy from e.g. output.
Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support")
Reported-by: Shell Chen <xierch@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().
Solution:
Add grace period before invalidating the removed rstat_css_node.
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1")
Cc: stable@vger.kernel.org # v5.13+
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When a TCP sends more bytes than allowed by the receive window, all future
packets can be marked as invalid.
This can clog up the conntrack table because of 5-day default timeout.
Sequence of packets:
01 initiator > responder: [S], seq 171, win 5840, options [mss 1330,sackOK,TS val 63 ecr 0,nop,wscale 1]
02 responder > initiator: [S.], seq 33211, ack 172, win 65535, options [mss 1460,sackOK,TS val 010 ecr 63,nop,wscale 8]
03 initiator > responder: [.], ack 33212, win 2920, options [nop,nop,TS val 068 ecr 010], length 0
04 initiator > responder: [P.], seq 172:240, ack 33212, win 2920, options [nop,nop,TS val 279 ecr 010], length 68
Window is 5840 starting from 33212 -> 39052.
05 responder > initiator: [.], ack 240, win 256, options [nop,nop,TS val 872 ecr 279], length 0
06 responder > initiator: [.], seq 33212:34530, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
This is fine, conntrack will flag the connection as having outstanding
data (UNACKED), which lowers the conntrack timeout to 300s.
07 responder > initiator: [.], seq 34530:35848, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
08 responder > initiator: [.], seq 35848:37166, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
09 responder > initiator: [.], seq 37166:38484, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
10 responder > initiator: [.], seq 38484:39802, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
Packet 10 is already sending more than permitted, but conntrack doesn't
validate this (only seq is tested vs. maxend, not 'seq+len').
38484 is acceptable, but only up to 39052, so this packet should
not have been sent (or only 568 bytes, not 1318).
At this point, connection is still in '300s' mode.
Next packet however will get flagged:
11 responder > initiator: [P.], seq 39802:40128, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 326
nf_ct_proto_6: SEQ is over the upper bound (over the window of the receiver) .. LEN=378 .. SEQ=39802 ACK=240 ACK PSH ..
Now, a couple of replies/acks comes in:
12 initiator > responder: [.], ack 34530, win 4368,
[.. irrelevant acks removed ]
16 initiator > responder: [.], ack 39802, win 8712, options [nop,nop,TS val 296201291 ecr 2982371892], length 0
This ack is significant -- this acks the last packet send by the
responder that conntrack considered valid.
This means that ack == td_end. This will withdraw the
'unacked data' flag, the connection moves back to the 5-day timeout
of established conntracks.
17 initiator > responder: ack 40128, win 10030, ...
This packet is also flagged as invalid.
Because conntrack only updates state based on packets that are
considered valid, packet 11 'did not exist' and that gets us:
nf_ct_proto_6: ACK is over upper bound 39803 (ACKed data not seen yet) .. SEQ=240 ACK=40128 WINDOW=10030 RES=0x00 ACK URG
Because this received and processed by the endpoints, the conntrack entry
remains in a bad state, no packets will ever be considered valid again:
30 responder > initiator: [F.], seq 40432, ack 2045, win 391, ..
31 initiator > responder: [.], ack 40433, win 11348, ..
32 initiator > responder: [F.], seq 2045, ack 40433, win 11348 ..
... all trigger 'ACK is over bound' test and we end up with
non-early-evictable 5-day default timeout.
NB: This patch triggers a bunch of checkpatch warnings because of silly
indent. I will resend the cleanup series linked below to reduce the
indent level once this change has propagated to net-next.
I could route the cleanup via nf but that causes extra backport work for
stable maintainers.
Link: https://lore.kernel.org/netfilter-devel/20220720175228.17880-1-fw@strlen.de/T/#mb1d7147d36294573cc4f81d00f9f8dadfdd06cd8
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Harshit Mogalapalli says:
In ebt_do_table() function dereferencing 'private->hook_entry[hook]'
can lead to NULL pointer dereference. [..] Kernel panic:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[..]
RIP: 0010:ebt_do_table+0x1dc/0x1ce0
Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88
[..]
Call Trace:
nf_hook_slow+0xb1/0x170
__br_forward+0x289/0x730
maybe_deliver+0x24b/0x380
br_flood+0xc6/0x390
br_dev_xmit+0xa2e/0x12c0
For some reason ebtables rejects blobs that provide entry points that are
not supported by the table, but what it should instead reject is the
opposite: blobs that DO NOT provide an entry point supported by the table.
t->valid_hooks is the bitmask of hooks (input, forward ...) that will see
packets. Providing an entry point that is not support is harmless
(never called/used), but the inverse isn't: it results in a crash
because the ebtables traverser doesn't expect a NULL blob for a location
its receiving packets for.
Instead of fixing all the individual checks, do what iptables is doing and
reject all blobs that differ from the expected hooks.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
When a driver returns -EOPNOTSUPP in dsa_port_bridge_join() but failed
to provide a reason for it, DSA attempts to set the extack to say that
software fallback will kick in.
The problem is, when we use brctl and the legacy bridge ioctls, the
extack will be NULL, and DSA dereferences it in the process of setting
it.
Sergei Antonov proves this using the following stack trace:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
PC is at dsa_slave_changeupper+0x5c/0x158
dsa_slave_changeupper from raw_notifier_call_chain+0x38/0x6c
raw_notifier_call_chain from __netdev_upper_dev_link+0x198/0x3b4
__netdev_upper_dev_link from netdev_master_upper_dev_link+0x50/0x78
netdev_master_upper_dev_link from br_add_if+0x430/0x7f4
br_add_if from br_ioctl_stub+0x170/0x530
br_ioctl_stub from br_ioctl_call+0x54/0x7c
br_ioctl_call from dev_ifsioc+0x4e0/0x6bc
dev_ifsioc from dev_ioctl+0x2f8/0x758
dev_ioctl from sock_ioctl+0x5f0/0x674
sock_ioctl from sys_ioctl+0x518/0xe40
sys_ioctl from ret_fast_syscall+0x0/0x1c
Fix the problem by only overriding the extack if non-NULL.
Fixes: 1c6e8088d9a7 ("net: dsa: allow port_bridge_join() to override extack message")
Link: https://lore.kernel.org/netdev/CABikg9wx7vB5eRDAYtvAm7fprJ09Ta27a4ZazC=NX5K4wn6pWA@mail.gmail.com/
Reported-by: Sergei Antonov <saproj@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20220819173925.3581871-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Looks to have been left out in an oversight.
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Sainath Grandhi <sainath.grandhi@intel.com>
Fixes: 235a9d89da97 ('ipvtap: IP-VLAN based tap driver')
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Vladimir Oltean says:
====================
DSA changes for multiple CPU ports (part 3)
Those who have been following part 1:
https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/
and part 2:
https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/
will know that I am trying to enable the second internal port pair from
the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q".
This series represents part 3 of that effort.
Covered here are some preparations in DSA for handling multiple DSA
masters:
- when changing the tagging protocol via sysfs
- when the masters go down
as well as preparation for monitoring the upper devices of a DSA master
(to support DSA masters under a LAG).
There are also 2 small preparations for the ocelot driver, for the case
where multiple tag_8021q CPU ports are used in a LAG. Both those changes
have to do with PGID forwarding domains.
Compared to v1, the patches were trimmed down to just another
preparation stage, and the UAPI changes were pushed further out to part 4.
https://patchwork.kernel.org/project/netdevbpf/cover/20220523104256.3556016-1-olteanv@gmail.com/
Compared to v2, I had to export a symbol I forgot to
(ocelot_port_teardown_dsa_8021q_cpu), to avoid a build breakage when the
felix and seville drivers are built as modules.
====================
Link: https://lore.kernel.org/r/20220819174820.3585002-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently when we have 2 CPU ports configured for DSA tag_8021q mode and
we put them in a LAG, a PGID dump looks like this:
PGID_SRC[0] = ports 4,
PGID_SRC[1] = ports 4,
PGID_SRC[2] = ports 4,
PGID_SRC[3] = ports 4,
PGID_SRC[4] = ports 0, 1, 2, 3, 4, 5,
PGID_SRC[5] = no ports
(ports 0-3 are user ports, ports 4 and 5 are CPU ports)
There are 2 problems with the configuration above:
- user ports should enable forwarding towards both CPU ports, not just 4,
and the aggregation PGIDs should prune one CPU port or the other from
the destination port mask, based on a hash computed from packet headers.
- CPU ports should not be allowed to forward towards themselves and also
not towards other ports in the same LAG as themselves
The first problem requires fixing up the PGID_SRC of user ports, when
ocelot_port_assigned_dsa_8021q_cpu_mask() is called. We need to say that
when a user port is assigned to a tag_8021q CPU port and that port is in
a LAG, it should forward towards all ports in that LAG.
The second problem requires fixing up the PGID_SRC of port 4, to remove
ports 4 and 5 (in a LAG) from the allowed destinations.
After this change, the PGID source masks look as follows:
PGID_SRC[0] = ports 4, 5,
PGID_SRC[1] = ports 4, 5,
PGID_SRC[2] = ports 4, 5,
PGID_SRC[3] = ports 4, 5,
PGID_SRC[4] = ports 0, 1, 2, 3,
PGID_SRC[5] = no ports
Note that PGID_SRC[5] still looks weird (it should say "0, 1, 2, 3" just
like PGID_SRC[4] does), but I've tested forwarding through this CPU port
and it doesn't seem like anything is affected (it appears that PGID_SRC[4]
is being looked up on forwarding from the CPU, since both ports 4 and 5
have logical port ID 4). The reason why it looks weird is because
we've never called ocelot_port_assign_dsa_8021q_cpu() for any user port
towards port 5 (all user ports are assigned to port 4 which is in a LAG
with 5).
Since things aren't broken, I'm willing to leave it like that for now
and just document the oddity.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This is a partial revert of commit c295f9831f1d ("net: mscc: ocelot:
switch from {,un}set to {,un}assign for tag_8021q CPU ports"), because
as it turns out, this isn't how tag_8021q CPU ports under a LAG are
supposed to work.
Under that scenario, all user ports are "assigned" to the single
tag_8021q CPU port represented by the logical port corresponding to the
bonding interface. So one CPU port in a LAG would have is_dsa_8021q_cpu
set to true (the one whose physical port ID is equal to the logical port
ID), and the other one to false.
In turn, this makes 2 undesirable things happen:
(1) PGID_CPU contains only the first physical CPU port, rather than both
(2) only the first CPU port will be added to the private VLANs used by
ocelot for VLAN-unaware bridging
To make the driver behave in the same way for both bonded CPU ports, we
need to bring back the old concept of setting up a port as a tag_8021q
CPU port, and this is what deals with VLAN membership and PGID_CPU
updating. But we also need the CPU port "assignment" (the user to CPU
port affinity), and this is what updates the PGID_SRC forwarding rules.
All DSA CPU ports are statically configured for tag_8021q mode when the
tagging protocol is changed to ocelot-8021q. User ports are "assigned"
to one CPU port or the other dynamically (this will be handled by a
future change).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
More logic will be added to dsa_tree_setup_master() and
dsa_tree_teardown_master() in upcoming changes.
Reduce the indentation by one level in these functions by introducing
and using a dedicated iterator for CPU ports of a tree.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The fact that the tagging protocol is set and queried from the
/sys/class/net/<dsa-master>/dsa/tagging file is a bit of a quirk from
the single CPU port days which isn't aging very well now that DSA can
have more than a single CPU port. This is because the tagging protocol
is a switch property, yet in the presence of multiple CPU ports it can
be queried and set from multiple sysfs files, all of which are handled
by the same implementation.
The current logic ensures that the net device whose sysfs file we're
changing the tagging protocol through must be down. That net device is
the DSA master, and this is fine for single DSA master / CPU port setups.
But exactly because the tagging protocol is per switch [ tree, in fact ]
and not per DSA master, this isn't fine any longer with multiple CPU
ports, and we must iterate through the tree and find all DSA masters,
and make sure that all of them are down.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This is an adaptation of commit c0a8a9c27493 ("net: dsa: automatically
bring user ports down when master goes down") for multiple DSA masters.
When a DSA master goes down, only the user ports under its control
should go down too, the others can still send/receive traffic.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
All the traffic to/from a DSA master is supposed to be distributed among
its DSA switch upper interfaces, so we should not allow other upper
device kinds.
An exception to this is DSA_TAG_PROTO_NONE (switches with no DSA tags),
and in that case it is actually expected to create e.g. VLAN interfaces
on the master. But for those, netdev_uses_dsa(master) returns false, so
the restriction doesn't apply.
The motivation for this change is to allow LAG interfaces of DSA masters
to be DSA masters themselves. We want to restrict the user's degrees of
freedom by 1: the LAG should already have all DSA masters as lowers, and
while lower ports of the LAG can be removed, none can be added after the
fact.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When DSA gains support for multiple CPU ports in a LAG, it will become
mandatory to monitor the changeupper events for the DSA master.
In fact, there are already some restrictions to be imposed in that area,
namely that a DSA master cannot be a bridge port except in some special
circumstances.
Centralize the restrictions at the level of the DSA layer as a
preliminary step.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
dsa_slave_prechangeupper_sanity_check() is supposed to enforce some
adjacency restrictions, and calls ds->ops->port_prechangeupper if the
driver implements it.
We convert the error code from the port_prechangeupper() call to a
notifier code, and 0 is converted to NOTIFY_OK, but the caller of
dsa_slave_prechangeupper_sanity_check() stops at any notifier code
different from NOTIFY_DONE.
Avoid this by converting back the notifier code to an error code, so
that both NOTIFY_OK and NOTIFY_DONE will be seen as 0. This allows more
parallel sanity check functions to be added.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Traditionally, DSA has had a single netdev notifier handling function
for each device type.
For the sake of code cleanliness, we would like to introduce more
handling functions which do one thing, but the conditions for entering
these functions start to overlap. Example: a handling function which
tracks whether any bridges contain both DSA and non-DSA interfaces.
Either this is placed before dsa_slave_changeupper(), case in which it
will prevent that function from executing, or we place it after
dsa_slave_changeupper(), case in which we will prevent it from
executing. The other alternative is to ignore errors from the new
handling function (not ideal).
To support this usage, we need to change the pattern. In the new model,
we enter all notifier handling sub-functions, and exit with NOTIFY_DONE
if there is nothing to do. This allows the sub-functions to be
relatively free-form and independent from each other.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Arseniy Krasnov says:
====================
vsock: updates for SO_RCVLOWAT handling
This patchset includes some updates for SO_RCVLOWAT:
1) af_vsock:
During my experiments with zerocopy receive, i found, that in some
cases, poll() implementation violates POSIX: when socket has non-
default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
POLLRDNORM bits in 'revents' even number of bytes available to read
on socket is smaller than SO_RCVLOWAT value. In this case,user sees
POLLIN flag and then tries to read data(for example using 'read()'
call), but read call will be blocked, because SO_RCVLOWAT logic is
supported in dequeue loop in af_vsock.c. But the same time, POSIX
requires that:
"POLLIN Data other than high-priority data may be read without
blocking.
POLLRDNORM Normal data may be read without blocking."
See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.
So, we have, that poll() syscall returns POLLIN, but read call will
be blocked.
Also in man page socket(7) i found that:
"Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
socket as readable only if at least SO_RCVLOWAT bytes are available."
I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
this case for TCP socket, it works as POSIX required.
I've added some fixes to af_vsock.c and virtio_transport_common.c,
test is also implemented.
2) virtio/vsock:
It adds some optimization to wake ups, when new data arrived. Now,
SO_RCVLOWAT is considered before wake up sleepers who wait new data.
There is no sense, to kick waiter, when number of available bytes
in socket's queue < SO_RCVLOWAT, because if we wake up reader in
this case, it will wait for SO_RCVLOWAT data anyway during dequeue,
or in poll() case, POLLIN/POLLRDNORM bits won't be set, so such
exit from poll() will be "spurious". This logic is also used in TCP
sockets.
3) vmci/vsock:
Same as 2), but i'm not sure about this changes. Will be very good,
to get comments from someone who knows this code.
4) Hyper-V:
As Dexuan Cui mentioned, for Hyper-V transport it is difficult to
support SO_RCVLOWAT, so he suggested to disable this feature for
Hyper-V.
====================
Link: https://lore.kernel.org/r/de41de4c-0345-34d7-7c36-4345258b7ba8@sberdevices.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds test to check, that when poll() returns POLLIN, POLLRDNORM bits,
next read call won't block.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds extra condition to wake up data reader: do it only when number
of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick
user, because it will wait until SO_RCVLOWAT bytes will be dequeued. This
check is performed in vsock_data_ready().
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds extra condition to wake up data reader: do it only when number
of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick
user,because it will wait until SO_RCVLOWAT bytes will be dequeued. This
check is performed in vsock_data_ready().
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This adds 'vsock_data_ready()' which must be called by transport to kick
sleeping data readers. It checks for SO_RCVLOWAT value before waking
user, thus preventing spurious wake ups. Based on 'tcp_data_ready()' logic.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Passing 1 as the target to notify_poll_in(), we don't honor
what the user has set via SO_RCVLOWAT, going to set POLLIN
and POLLRDNORM, even if we don't have the amount of bytes
expected by the user.
Let's use sock_rcvlowat() to get the right target to pass to
notify_poll_in();
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|