Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix regression with scanning not working in some systems.
* tag 'for-net-2022-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: Fix not checking MGMT cmd pending queue
====================
Link: https://lore.kernel.org/r/20220302004330.125536-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Íñigo Huguet says:
====================
sfc: optimize RXQs count and affinities
In sfc driver one RX queue per physical core was allocated by default.
Later on, IRQ affinities were set spreading the IRQs in all NUMA local
CPUs.
However, with that default configuration it result in a non very optimal
configuration in many modern systems. Specifically, in systems with hyper
threading and 2 NUMA nodes, affinities are set in a way that IRQs are
handled by all logical cores of one same NUMA node. Handling IRQs from
both hyper threading siblings has no benefit, and setting affinities to one
queue per physical core is neither a very good idea because there is a
performance penalty for moving data across nodes (I was able to check it
with some XDP tests using pktgen).
This patches reduce the default number of channels to one per physical
core in the local NUMA node. Then, they set IRQ affinities to CPUs in
the local NUMA node only. This way we save hardware resources since
channels are limited resources. We also leave more room for XDP_TX
channels without hitting driver's limit of 32 channels per interface.
Running performance tests using iperf with a SFC9140 device showed no
performance penalty for reducing the number of channels.
RX XDP tests showed that performance can go down to less than half if
the IRQ is handled by a CPU in a different NUMA node, which doesn't
happen with the new defaults from this patches.
====================
Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Affinity hints were being set to CPUs in local NUMA node first, and then
in other CPUs. This was creating 2 unintended issues:
1. Channels created to be assigned each to a different physical core
were assigned to hyperthreading siblings because of being in same
NUMA node.
Since the patch previous to this one, this did not longer happen
with default rss_cpus modparam because less channels are created.
2. XDP channels could be assigned to CPUs in different NUMA nodes,
decreasing performance too much (to less than half in some of my
tests).
This patch sets the affinity hints spreading the channels only in local
NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
is online has been added too.
Example of CPUs being assigned in a non optimal way before this and the
previous patch (note: in this system, xdp-8 to xdp-15 are created
because num_possible_cpus == 64, but num_present_cpus == 32 so they're
never used):
$ lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
/proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
/proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
/proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
/proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
/proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
/proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
/proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
/proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
/proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
/proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
/proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
/proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
/proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
/proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
/proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
/proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
/proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
/proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
/proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
/proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
/proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
/proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
/proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31
CPUs assignments after this and previous patch, so normal channels
created only one per core in NUMA node and affinities set only to local
NUMA node:
$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
/proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
/proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
/proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
/proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
/proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
/proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
/proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
/proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
/proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
/proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
/proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
/proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
/proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
/proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
/proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.
Fallback to all other online cores if there are not online CPUs in local
NUMA node.
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix build:
include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’
45 | #define min(x, y) __careful_cmp(x, y, <)
| ^~~~~~~~~~~~~
net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’
150 | corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
| ^~~
Fixes: 12bbb0d163a9 ("net/smc: add sysctl for autocorking")
Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A number of places in the MGMT handlers we examine the command queue for
other commands (in progress but not yet complete) that will interact
with the process being performed. However, not all commands go into the
queue if one of:
1. There is no negative side effect of consecutive or redundent commands
2. The command is entirely perform "inline".
This change examines each "pending command" check, and if it is not
needed, deletes the check. Of the remaining pending command checks, we
make sure that the command is in the pending queue by using the
mgmt_pending_add/mgmt_pending_remove pair rather than the
mgmt_pending_new/mgmt_pending_free pair.
Link: https://lore.kernel.org/linux-bluetooth/f648f2e11bb3c2974c32e605a85ac3a9fac944f1.camel@redhat.com/T/
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Syzkaller reports another issue:
WARNING: CPU: 0 PID: 10775 at include/linux/thread_info.h:230
check_copy_size include/linux/thread_info.h:230 [inline]
WARNING: CPU: 0 PID: 10775 at include/linux/thread_info.h:230
copy_to_user include/linux/uaccess.h:199 [inline]
WARNING: CPU: 0 PID: 10775 at include/linux/thread_info.h:230
bpf_test_finish.isra.0+0x4b2/0x680 net/bpf/test_run.c:171
This can happen when the userspace buffer is smaller than head + frags.
Return ENOSPC in this case.
Fixes: 7855e0db150a ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Reported-by: syzbot+5f81df6205ecbbc56ab5@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/bpf/20220228232332.458871-1-sdf@google.com
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Use kfree_rcu(ptr, rcu) variant, using kfree_rcu(ptr) was not
intentional. From Eric Dumazet.
2) Use-after-free in netfilter hook core, from Eric Dumazet.
3) Missing rcu read lock side for netfilter egress hook,
from Florian Westphal.
4) nf_queue assume state->sk is full socket while it might not be.
Invoke sock_gen_put(), from Florian Westphal.
5) Add selftest to exercise the reported KASAN splat in 4)
6) Fix possible use-after-free in nf_queue in case sk_refcnt is 0.
Also from Florian.
7) Use input interface index only for hardware offload, not for
the software plane. This breaks tc ct action. Patch from Paul Blakey.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
net/sched: act_ct: Fix flow table lookup failure with no originating ifindex
netfilter: nf_queue: handle socket prefetch
netfilter: nf_queue: fix possible use-after-free
selftests: netfilter: add nfqueue TCP_NEW_SYN_RECV socket race test
netfilter: nf_queue: don't assume sk is full socket
netfilter: egress: silence egress hook lockdep splats
netfilter: fix use-after-free in __nf_register_net_hook()
netfilter: nf_tables: prefer kfree_rcu(ptr, rcu) variant
====================
Link: https://lore.kernel.org/r/20220301215337.378405-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After cited commit optimizted hw insertion, flow table entries are
populated with ifindex information which was intended to only be used
for HW offload. This tuple ifindex is hashed in the flow table key, so
it must be filled for lookup to be successful. But tuple ifindex is only
relevant for the netfilter flowtables (nft), so it's not filled in
act_ct flow table lookup, resulting in lookup failure, and no SW
offload and no offload teardown for TCP connection FIN/RST packets.
To fix this, add new tc ifindex field to tuple, which will
only be used for offloading, not for lookup, as it will not be
part of the tuple hash.
Fixes: 9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"The bigger part of the change is a revert for x86 hosts. Here the
second patch was supposed to fix the first, but in reality it was just
as broken, so both have to go.
x86 host:
- Revert incorrect assumption that cr3 changes come with preempt
notifier callbacks (they don't when static branches are changed,
for example)
ARM host:
- Correctly synchronise PMR and co on PSCI CPU_SUSPEND
- Skip tests that depend on GICv3 when the HW isn't available"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3
Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"
Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"
KVM: arm64: Don't miss pending interrupts for suspended vCPU
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull binfmt_elf fix from Kees Cook:
"This addresses a regression[1] under ia64 where some ET_EXEC binaries
were not loading"
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/a3edd529-c42d-3b09-135c-7e98a15b150f@leemhuis.info/ [1]
- Fix ia64 ET_EXEC loading
* tag 'binfmt_elf-v5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf: Avoid total_mapping_size for ET_EXEC
|
|
Partially revert commit 5f501d555653 ("binfmt_elf: reintroduce using
MAP_FIXED_NOREPLACE"), which applied the ET_DYN "total_mapping_size"
logic also to ET_EXEC.
At least ia64 has ET_EXEC PT_LOAD segments that are not virtual-address
contiguous (but _are_ file-offset contiguous). This would result in a
giant mapping attempting to cover the entire span, including the virtual
address range hole, and well beyond the size of the ELF file itself,
causing the kernel to refuse to load it. For example:
$ readelf -lW /usr/bin/gcc
...
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz ...
...
LOAD 0x000000 0x4000000000000000 0x4000000000000000 0x00b5a0 0x00b5a0 ...
LOAD 0x00b5a0 0x600000000000b5a0 0x600000000000b5a0 0x0005ac 0x000710 ...
...
^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^
File offset range : 0x000000-0x00bb4c
0x00bb4c bytes
Virtual address range : 0x4000000000000000-0x600000000000bcb0
0x200000000000bcb0 bytes
Remove the total_mapping_size logic for ET_EXEC, which reduces the
ET_EXEC MAP_FIXED_NOREPLACE coverage to only the first PT_LOAD (better
than nothing), and retains it for ET_DYN.
Ironically, this is the reverse of the problem that originally caused
problems with MAP_FIXED_NOREPLACE: overlapping PT_LOAD segments. Future
work could restore full coverage if load_elf_binary() were to perform
mappings in a separate phase from the loading (where it could resolve
both overlaps and holes).
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Reported-by: matoro <matoro_bugzilla_kernel@matoro.tk>
Fixes: 5f501d555653 ("binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE")
Link: https://lore.kernel.org/r/a3edd529-c42d-3b09-135c-7e98a15b150f@leemhuis.info
Tested-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Link: https://lore.kernel.org/lkml/ce8af9c13bcea9230c7689f3c1e0e2cd@matoro.tk
Tested-By: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/lkml/49182d0d-708b-4029-da5f-bc18603440a6@physik.fu-berlin.de
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Remove non-inclusive language from the iavf driver.
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Currently there are functions in iavf_virtchnl.c for polling specific
virtchnl receive events. These are all assigning iavf_status values to
int values. Fix this and explicitly assign int values if iavf_status
is not IAVF_SUCCESS.
Also, refactor a small amount of duplicated code that can be reused by
all of the previously mentioned functions.
Finally, fix some spacing errors for variable assignment and get rid of
all the goto statements in the refactored functions for clarity.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Several functions in the iAVF core files take status values of the enum
iavf_status and convert them into integer values. This leads to
confusion as functions return both Linux errno values and status codes
intermixed. Reporting status codes as if they were "errno" values can
lead to confusion when reviewing error logs. Additionally, it can lead
to unexpected behavior if a return value is not interpreted properly.
Fix this by introducing iavf_status_to_errno, a switch that explicitly
converts from the status codes into an appropriate error value. Also
introduce a virtchnl_status_to_errno function for the one case where we
were returning both virtchnl status codes and iavf_status codes in the
same function.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Return value directly instead of taking this in another redundant
variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Use new type field of VIRTCHNL_OP_ADD_ETH_ADDR and
VIRTCHNL_OP_DEL_ETH_ADDR requests to indicate that
VF wants to change its default MAC address.
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
In order to handle the capability exchange necessary for
VIRTCHNL_VF_OFFLOAD_VLAN_V2, the driver must send
a VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS message. This must occur prior to
__IAVF_CONFIG_ADAPTER, and the driver must wait for the response from
the PF.
To handle this, the __IAVF_INIT_GET_OFFLOAD_VLAN_V2_CAPS state was
introduced. This state is intended to process the response from the VLAN
V2 caps message. This works ok, but is difficult to extend to adding
more extended capability exchange.
Existing (and future) AVF features are relying more and more on these
sort of extended ops for processing additional capabilities. Just like
VLAN V2, this exchange must happen prior to __IAVF_CONFIG_ADPATER.
Since we only send one outstanding AQ message at a time during init, it
is not clear where to place this state. Adding more capability specific
states becomes a mess. Instead of having the "previous" state send
a message and then transition into a capability-specific state,
introduce __IAVF_EXTENDED_CAPS state. This state will use a list of
extended_caps that determines what messages to send and receive. As long
as there are extended_caps bits still set, the driver will remain in
this state performing one send or one receive per state machine loop.
Refactor the VLAN V2 negotiation to use this new state, and remove the
capability-specific state. This makes it significantly easier to add
a new similar capability exchange going forward.
Extended capabilities are processed by having an associated SEND and
RECV extended capability bit. During __IAVF_EXTENDED_CAPS, the
driver checks these bits in order by feature, first the send bit for
a feature, then the recv bit for a feature. Each send flag will call
a function that sends the necessary response, while each receive flag
will wait for the response from the PF. If a given feature can't be
negotiated with the PF, the associated flags will be cleared in
order to skip processing of that feature.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Advanced link speed support was added long back, but adding AIM support was
missed. This patch adds AIM support for advanced link speed support, which
allows the algorithm to take into account 50G/100G link speeds. Also, other
previous speeds are taken into consideration when advanced link speeds are
supported.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
johannes Berg says:
====================
Some last-minute fixes:
* rfkill
- add missing rfill_soft_blocked() when disabled
* cfg80211
- handle a nla_memdup() failure correctly
- fix CONFIG_CFG80211_EXTRA_REGDB_KEYDIR typo in
Makefile
* mac80211
- fix EAPOL handling in 802.3 RX path
- reject setting up aggregation sessions before
connection is authorized to avoid timeouts or
similar
- handle some SAE authentication steps correctly
- fix AC selection in mesh forwarding
* iwlwifi
- remove TWT support as it causes firmware crashes
when the AP isn't behaving correctly
- check debugfs pointer before dereferncing it
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dust Li says:
====================
net/smc: some datapath performance optimizations
This series tries to improve the performance of SMC in datapath.
- patch #1, add sysctl interface to support tuning the behaviour of
SMC in container environment.
- patch #2/#3, add autocorking support which is very efficient for small
messages without trade-off for latency.
- patch #4, send directly on setting TCP_NODELAY, without wake up the
TX worker, this make it consistent with clearing TCP_CORK.
- patch #5, this correct the setting of RMB window update limit, so
we don't send CDC messages to update peer's RMB window too frequently
in some cases.
- patch #6, implemented something like NAPI in SMC, decrease the number
of hardirq when busy.
- patch #7, this moves TX work doing in the BH to the user context when
sock_lock is hold by user.
With this patchset applied, we can get a good performance gain:
- qperf tcp_bw test has shown a great improvement. Other benchmarks like
'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
- In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
better then TCP in most all message size.
Here are some test results with the following testing command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
==== Bandwidth ====
MsgSize Origin SMC TCP SMC with patches
1 0.578 MB/s 2.392 MB/s(313.57%) 2.561 MB/s(342.83%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.162 MB/s(345.46%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.122 MB/s(343.46%)
8 4.668 MB/s 19.040 MB/s(307.86%) 20.521 MB/s(339.59%)
16 9.147 MB/s 38.904 MB/s(325.31%) 40.823 MB/s(346.29%)
32 18.369 MB/s 79.587 MB/s(333.25%) 80.535 MB/s(338.42%)
64 36.562 MB/s 148.668 MB/s(306.61%) 158.170 MB/s(332.60%)
128 72.961 MB/s 274.913 MB/s(276.80%) 316.217 MB/s(333.41%)
256 144.705 MB/s 512.059 MB/s(253.86%) 626.019 MB/s(332.62%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1221.596 MB/s(322.88%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2203.156 MB/s(283.70%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 3036.448 MB/s(177.25%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 3834.271 MB/s( 85.58%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 4904.910 MB/s( 31.95%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 5220.272 MB/s( 10.08%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5321.865 MB/s( -0.52%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5245.021 MB/s( 1.59%)
==== Latency ====
MsgSize Origin SMC TCP SMC with patches
1 10.540 us 11.938 us( 13.26%) 10.356 us( -1.75%)
2 10.996 us 11.992 us( 9.06%) 10.073 us( -8.39%)
4 10.229 us 11.687 us( 14.25%) 9.996 us( -2.28%)
8 10.203 us 11.653 us( 14.21%) 10.063 us( -1.37%)
16 10.530 us 11.313 us( 7.44%) 10.013 us( -4.91%)
32 10.241 us 11.586 us( 13.13%) 10.081 us( -1.56%)
64 10.693 us 11.652 us( 8.97%) 9.986 us( -6.61%)
128 10.597 us 11.579 us( 9.27%) 10.262 us( -3.16%)
256 10.409 us 11.957 us( 14.87%) 10.148 us( -2.51%)
512 11.088 us 12.505 us( 12.78%) 10.206 us( -7.95%)
1024 11.240 us 12.255 us( 9.03%) 10.631 us( -5.42%)
2048 11.485 us 16.970 us( 47.76%) 10.981 us( -4.39%)
4096 12.077 us 13.948 us( 15.49%) 11.847 us( -1.90%)
8192 13.683 us 16.693 us( 22.00%) 13.336 us( -2.54%)
16384 16.470 us 23.615 us( 43.38%) 16.519 us( 0.30%)
32768 22.540 us 40.966 us( 81.75%) 22.452 us( -0.39%)
65536 34.192 us 73.003 us(113.51%) 33.916 us( -0.81%)
------------
Test environment notes:
1. Testing is run on 2 VMs within the same physical host
2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
2 VMs respectively.
3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
physical CPUs are all isolated using boot parameter `isolcpus=xxx`
4. The queue number are set to 1, and interrupt from the queue is binded to
CPU0 in the guest
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Send data all the way down to the RDMA device is a time
consuming operation(get a new slot, maybe do RDMA Write
and send a CDC, etc). Moving those operations from BH
to user context is good for performance.
If the sock_lock is hold by user, we don't try to send
data out in the BH context, but just mark we should
send. Since the user will release the sock_lock soon, we
can do the sending there.
Add smc_release_cb() which will be called in release_sock()
and try send in the callback if needed.
This patch moves the sending part out from BH if sock lock
is hold by user. In my testing environment, this saves about
20% softirq in the qperf 4K tcp_bw test in the sender side
with no noticeable throughput drop.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we are handling softirq workload, enable hardirq may
again interrupt the current routine of softirq, and then
try to raise softirq again. This only wastes CPU cycles
and won't have any real gain.
Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
ib_req_notify_cq() returns 0, it is safe to wait for the
next event, with no need to poll the CQ again in this case.
This patch disables hardirq during the processing of softirq,
and re-arm the CQ after softirq is done. Somehow like NAPI.
Co-developed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
rmbe_update_limit is used to limit announcing receive
window updating too frequently. RFC7609 request a minimal
increase in the window size of 10% of the receive buffer
space. But current implementation used:
min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
always less then 10% of the receive buffer space.
This causes the receiver always sending CDC message to
update its consumer cursor when it consumes more then 2K
of data. And as a result, we may encounter something like
"TCP silly window syndrome" when sending 2.5~8K message.
This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
With this patch and SMC autocorking enabled, qperf 2K/4K/8K
tcp_bw test shows 45%/75%/40% increase in throughput respectively.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit ea785a1a573b("net/smc: Send directly when
TCP_CORK is cleared"), we don't use delayed work
to implement cork.
This patch use the same algorithm, removes the
delayed work when setting TCP_NODELAY and send
directly in setsockopt(). This also makes the
TCP_NODELAY the same as TCP.
Cc: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This add a new sysctl: net.smc.autocorking_size
We can dynamically change the behaviour of autocorking
by change the value of autocorking_size.
Setting to 0 disables autocorking in SMC
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds autocorking support for SMC which could improve
throughput for small message by x3+.
The main idea is borrowed from TCP autocorking with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
completed, cork the new messages until:
a) Receive CQE for the last Tx WR
b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
the last Tx WR to prevent the corked messages hang in
the send queue.
Both SMC autocorking and TCP autocorking check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.
Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.
SMC autocorking should not bring extra latency since the first
message will always been sent out immediately.
The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocorking has not increase any
ping-pong latency.
Test command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
=== Bandwidth ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%)
8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%)
16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%)
32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%)
64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%)
128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%)
256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%)
==== Latency ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%)
2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%)
4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%)
8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%)
16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%)
32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%)
64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%)
128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%)
256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%)
512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%)
1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%)
2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%)
4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%)
8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%)
16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%)
32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%)
65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%)
With SMC autocorking support, we can archive better throughput
than TCP in most message sizes without any latency trade-off.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch add sysctl interface to support container environment
for SMC as we talk in the mail list.
Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.com
Co-developed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The kbuild change here accidentally removed not only the
unquoting, but also the last character of the variable
name. Fix that.
Fixes: 129ab0d2d9f3 ("kbuild: do not quote string values in include/config/auto.conf")
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20220221155512.1d25895f7c5f.I50fa3d4189fcab90a2896fe8cae215035dae9508@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Update btf_dump case for conflicting names caused by forward declaration.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220301053250.1464204-3-xukuohai@huawei.com
|
|
Currently if a declaration appears in the BTF before the definition, the
definition is dumped as a conflicting name, e.g.:
$ bpftool btf dump file vmlinux format raw | grep "'unix_sock'"
[81287] FWD 'unix_sock' fwd_kind=struct
[89336] STRUCT 'unix_sock' size=1024 vlen=14
$ bpftool btf dump file vmlinux format c | grep "struct unix_sock"
struct unix_sock;
struct unix_sock___2 { <--- conflict, the "___2" is unexpected
struct unix_sock___2 *unix_sk;
This causes a compilation error if the dump output is used as a header file.
Fix it by skipping declaration when counting duplicated type names.
Fixes: 351131b51c7a ("libbpf: add btf_dump API for BTF-to-C conversion")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220301053250.1464204-2-xukuohai@huawei.com
|
|
In case someone combines bpf socket assign and nf_queue, then we will
queue an skb who references a struct sock that did not have its
reference count incremented.
As we leave rcu protection, there is no guarantee that skb->sk is still
valid.
For refcount-less skb->sk case, try to increment the reference count
and then override the destructor.
In case of failure we have two choices: orphan the skb and 'delete'
preselect or let nf_queue() drop the packet.
Do the latter, it should not happen during normal operation.
Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
Acked-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Eric Dumazet says:
The sock_hold() side seems suspect, because there is no guarantee
that sk_refcnt is not already 0.
On failure, we cannot queue the packet and need to indicate an
error. The packet will be dropped by the caller.
v2: split skb prefetch hunk into separate change
Fixes: 271b72c7fa82c ("udp: RCU handling for Unicast packets.")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
causes:
BUG: KASAN: slab-out-of-bounds in sk_free+0x25/0x80
Write of size 4 at addr ffff888106df0284 by task nf-queue/1459
sk_free+0x25/0x80
nf_queue_entry_release_refs+0x143/0x1a0
nf_reinject+0x233/0x770
... without 'netfilter: nf_queue: don't assume sk is full socket'.
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
There is no guarantee that state->sk refers to a full socket.
If refcount transitions to 0, sock_put calls sk_free which then ends up
with garbage fields.
I'd like to thank Oleksandr Natalenko and Jiri Benc for considerable
debug work and pointing out state->sk oddities.
Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
When we get anti-clogging token required (added by the commit
mentioned below), or the other status codes added by the later
commit 4e56cde15f7d ("mac80211: Handle special status codes in
SAE commit") we currently just pretend (towards the internal
state machine of authentication) that we didn't receive anything.
This has the undesirable consequence of retransmitting the prior
frame, which is not expected, because the timer is still armed.
If we just disarm the timer at that point, it would result in
the undesirable side effect of being in this state indefinitely
if userspace crashes, or so.
So to fix this, reset the timer and set a new auth_data->waiting
in order to have no more retransmissions, but to have the data
destroyed when the timer actually fires, which will only happen
if userspace didn't continue (i.e. crashed or abandoned it.)
Fixes: a4055e74a2ff ("mac80211: Don't destroy auth data in case of anti-clogging")
Reported-by: Jouni Malinen <j@w1.fi>
Link: https://lore.kernel.org/r/20220224103932.75964e1d7932.Ia487f91556f29daae734bf61f8181404642e1eec@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
As there's potential for failure of the nla_memdup(),
check the return value.
Fixes: a442b761b24b ("cfg80211: add add_nan_func / del_nan_func")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Link: https://lore.kernel.org/r/20220301100020.3801187-1-jiasheng@iscas.ac.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When "debugfs=off" is used on the kernel command line, iwiwifi's
mvm module uses an invalid/unchecked debugfs_dir pointer and causes
a BUG:
BUG: kernel NULL pointer dereference, address: 000000000000004f
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 503 Comm: modprobe Tainted: G W 5.17.0-rc5 #7
Hardware name: Dell Inc. Inspiron 15 5510/076F7Y, BIOS 2.4.1 11/05/2021
RIP: 0010:iwl_mvm_dbgfs_register+0x692/0x700 [iwlmvm]
Code: 69 a0 be 80 01 00 00 48 c7 c7 50 73 6a a0 e8 95 cf ee e0 48 8b 83 b0 1e 00 00 48 c7 c2 54 73 6a a0 be 64 00 00 00 48 8d 7d 8c <48> 8b 48 50 e8 15 22 07 e1 48 8b 43 28 48 8d 55 8c 48 c7 c7 5f 73
RSP: 0018:ffffc90000a0ba68 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffff88817d6e3328 RCX: ffff88817d6e3328
RDX: ffffffffa06a7354 RSI: 0000000000000064 RDI: ffffc90000a0ba6c
RBP: ffffc90000a0bae0 R08: ffffffff824e4880 R09: ffffffffa069d620
R10: ffffc90000a0ba00 R11: ffffffffffffffff R12: 0000000000000000
R13: ffffc90000a0bb28 R14: ffff88817d6e3328 R15: ffff88817d6e3320
FS: 00007f64dd92d740(0000) GS:ffff88847f640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000004f CR3: 000000016fc79001 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
? iwl_mvm_mac_setup_register+0xbdc/0xda0 [iwlmvm]
iwl_mvm_start_post_nvm+0x71/0x100 [iwlmvm]
iwl_op_mode_mvm_start+0xab8/0xb30 [iwlmvm]
_iwl_op_mode_start+0x6f/0xd0 [iwlwifi]
iwl_opmode_register+0x6a/0xe0 [iwlwifi]
? 0xffffffffa0231000
iwl_mvm_init+0x35/0x1000 [iwlmvm]
? 0xffffffffa0231000
do_one_initcall+0x5a/0x1b0
? kmem_cache_alloc+0x1e5/0x2f0
? do_init_module+0x1e/0x220
do_init_module+0x48/0x220
load_module+0x2602/0x2bc0
? __kernel_read+0x145/0x2e0
? kernel_read_file+0x229/0x290
__do_sys_finit_module+0xc5/0x130
? __do_sys_finit_module+0xc5/0x130
__x64_sys_finit_module+0x13/0x20
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f64dda564dd
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 29 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffdba393f88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f64dda564dd
RDX: 0000000000000000 RSI: 00005575399e2ab2 RDI: 0000000000000001
RBP: 000055753a91c5e0 R08: 0000000000000000 R09: 0000000000000002
R10: 0000000000000001 R11: 0000000000000246 R12: 00005575399e2ab2
R13: 000055753a91ceb0 R14: 0000000000000000 R15: 000055753a923018
</TASK>
Modules linked in: btintel(+) btmtk bluetooth vfat snd_hda_codec_hdmi fat snd_hda_codec_realtek snd_hda_codec_generic iwlmvm(+) snd_sof_pci_intel_tgl mac80211 snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence soundwire_bus snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_xtensa_dsp snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core btrfs snd_compress snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec raid6_pq iwlwifi snd_hda_core snd_pcm snd_timer snd soundcore cfg80211 intel_ish_ipc(+) thunderbolt rfkill intel_ishtp ucsi_acpi wmi i2c_hid_acpi i2c_hid evdev
CR2: 000000000000004f
---[ end trace 0000000000000000 ]---
Check the debugfs_dir pointer for an error before using it.
Fixes: 8c082a99edb9 ("iwlwifi: mvm: simplify iwl_mvm_dbgfs_register")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: linux-wireless@vger.kernel.org
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220223030630.23241-1-rdunlap@infradead.org
[change to make both conditional]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Some APs misbehave when TWT is used and cause our firmware to crash.
We don't know a reasonable way to detect and work around this problem
in the FW yet. To prevent these crashes, disable TWT in the driver by
stopping to advertise TWT support.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215523
Signed-off-by: Golan Ben Ami <golan.ben.ami@intel.com>
[reworded the commit message]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20220301072926.153969-1-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
If CONFIG_RFKILL is not set, the Intel WiFi driver will not build
the iw_mvm driver part due to the missing rfill_soft_blocked()
call. Adding a inline declaration of rfill_soft_blocked() if
CONFIG_RFKILL=n fixes the following error:
drivers/net/wireless/intel/iwlwifi/mvm/mvm.h: In function 'iwl_mvm_mei_set_sw_rfkill_state':
drivers/net/wireless/intel/iwlwifi/mvm/mvm.h:2215:38: error: implicit declaration of function 'rfkill_soft_blocked'; did you mean 'rfkill_blocked'? [-Werror=implicit-function-declaration]
2215 | mvm->hw_registered ? rfkill_soft_blocked(mvm->hw->wiphy->rfkill) : false;
| ^~~~~~~~~~~~~~~~~~~
| rfkill_blocked
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reported-by: Neill Whillans <neill.whillans@codethink.co.uk>
Fixes: 5bc9a9dd7535 ("rfkill: allow to get the software rfkill state")
Link: https://lore.kernel.org/r/20220218093858.1245677-1-ben.dooks@codethink.co.uk
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
It was the intention to reverse the bits, not make them all zero by
using logical NOT operator.
Fixes: cc19db8b312a ("MIPS: ralink: mt7621: do memory detection on KSEG1")
Suggested-by: Chuanhong Guo <gch981213@gmail.com>
Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com>
Reviewed-by: Sergio Paracuellos <sergio.paracuellos@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
Roopa Prabhu says:
====================
vxlan metadata device vnifiltering support
This series adds vnifiltering support to vxlan collect metadata device.
Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.
This series adds vnifiltering support to collect metadata devices to
terminate only configured vnis. This is similar to vlan filtering in
bridge driver. The vni filtering capability is provided by a new flag on
collect metadata device.
In the below pic:
- customer1 is mapped to br1 bridge domain
- customer2 is mapped to br2 bridge domain
- customer1 vlan 10-11 is mapped to vni 1001-1002
- customer2 vlan 10-11 is mapped to vni 2001-2002
- br1 and br2 are vlan filtering bridges
- vxlan1 and vxlan2 are collect metadata devices with
vnifiltering enabled
┌──────────────────────────────────────────────────────────────────┐
│ switch │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ │ │ │ │
│ │ br1 │ │ br2 │ │
│ └┬─────────┬┘ └──┬───────┬┘ │
│ vlans│ │ vlans │ │ │
│ 10,11│ │ 10,11│ │ │
│ │ vlanvnimap: │ vlanvnimap: │
│ │ 10-1001,11-1002 │ 10-2001,11-2002 │
│ │ │ │ │ │
│ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │
│ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │
│ │ │ │ vnifilter:│ │ │ │vxlan2 │ │
│ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │
│ │ └────────────┘ │ │ 2001,2002 │ │
│ │ │ └──────────────┘ │
│ │ │ │
└───────┼──────────────────────────────────┼───────────────────────┘
│ │
│ │
┌─────┴───────┐ │
│ customer1 │ ┌─────┴──────┐
│ host/VM │ │customer2 │
└─────────────┘ │ host/VM │
└────────────┘
v2:
- remove stale xstats declarations pointed out by Nikolay Aleksandrov
- squash selinux patch with the tunnel api patch as pointed out by
benjamin poirier
- Fix various build issues:
Reported-by: kernel test robot <lkp@intel.com>
v3:
- incorporate review feedback from Jakub
- move rhashtable declarations to c file
- define and use netlink policy for top level vxlan filter api
- fix unused stats function warning
- pass vninode from vnifilter lookup into stats count function
to avoid another lookup (only applicable to vxlan_rcv)
- fix missing vxlan vni delete notifications in vnifilter uninit
function
- misc cleanups
- remote dev check for multicast groups added via vnifiltering api
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support for VXLAN vni filter entries' stats dumping
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add per-vni statistics for vni filter mode. Counting Rx/Tx
bytes/packets/drops/errors at the appropriate places.
This patch changes vxlan_vs_find_vni to also return the
vxlan_vni_node in cases where the vni belongs to a vni
filtering vxlan device
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds a new test script test_vxlan_vnifiltering.sh
with tests for vni filtering api, various datapath tests.
Also has a test with a mix of traditional, metadata and vni
filtering devices inuse at the same time.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds vnifiltering support to collect metadata device.
Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.
vnifiltering support in collect metadata devices terminates only configured
vnis. This is similar to vlan filtering in bridge driver. The vni filtering
capability is provided by a new flag on collect metadata device.
In the below pic:
- customer1 is mapped to br1 bridge domain
- customer2 is mapped to br2 bridge domain
- customer1 vlan 10-11 is mapped to vni 1001-1002
- customer2 vlan 10-11 is mapped to vni 2001-2002
- br1 and br2 are vlan filtering bridges
- vxlan1 and vxlan2 are collect metadata devices with
vnifiltering enabled
┌──────────────────────────────────────────────────────────────────┐
│ switch │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ │ │ │ │
│ │ br1 │ │ br2 │ │
│ └┬─────────┬┘ └──┬───────┬┘ │
│ vlans│ │ vlans │ │ │
│ 10,11│ │ 10,11│ │ │
│ │ vlanvnimap: │ vlanvnimap: │
│ │ 10-1001,11-1002 │ 10-2001,11-2002 │
│ │ │ │ │ │
│ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │
│ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │
│ │ │ │ vnifilter:│ │ │ │vxlan2 │ │
│ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │
│ │ └────────────┘ │ │ 2001,2002 │ │
│ │ │ └──────────────┘ │
│ │ │ │
└───────┼──────────────────────────────────┼───────────────────────┘
│ │
│ │
┌─────┴───────┐ │
│ customer1 │ ┌─────┴──────┐
│ host/VM │ │customer2 │
└─────────────┘ │ host/VM │
└────────────┘
With this implementation, vxlan dst metadata device can
be associated with range of vnis.
struct vxlan_vni_node is introduced to represent
a configured vni. We start with vni and its
associated remote_ip in this structure. This
structure can be extended to bring in other
per vni attributes if there are usecases for it.
A vni inherits an attribute from the base vxlan device
if there is no per vni attributes defined.
struct vxlan_dev gets a new rhashtable for
vnis called vxlan_vni_group. vxlan_vnifilter.c
implements the necessary netlink api, notifications
and helper functions to process and manage lifecycle
of vxlan_vni_node.
This patch also adds new helper functions in vxlan_multicast.c
to handle per vni remote_ip multicast groups which are part
of vxlan_vni_group.
Fix build problems:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
subsequent patches will add more helpers.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds new rtm tunnel msg and api for tunnel id
filtering in dst_metadata devices. First dst_metadata
device to use the api is vxlan driver with AF_BRIDGE
family.
This and later changes add ability in vxlan driver to do
tunnel id filtering (or vni filtering) on dst_metadata
devices. This is similar to vlan api in the vlan filtering bridge.
this patch includes selinux nlmsg_route_perms support for RTM_*TUNNEL
api from Benjamin Poirier.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
more users in follow up patches
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch changes multicast helpers to take rip and ifindex as input.
This is needed in future patches where rip can come from a pervni
structure while the ifindex can come from the vxlan device.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|