Age | Commit message (Collapse) | Author |
|
Re-code the pause in-band advertisement update in light of the addition
of PCS support, so that we perform the minimum required; only the PCS
configuration function needs to be called in this case, followed by the
request to trigger a restart of negotiation if the programmed
advertisement changed.
We need to change the pcs_config() signature to pass whether resolved
pause should be passed to the MAC for setups such as mvneta and mvpp2
where doing so overrides the MAC manual flow controls.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For fixed links, we only allow the current settings, so this should be
a matter of merely rejecting an attempt to change the settings. If the
settings agree, then there is nothing more we need to do.
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rather than recomputing whether AN is enabled, use config.an_enabled.
Suggested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we have a PHY attached, an ethtool ksettings_set() call only
really needs to call through to the phylib equivalent; phylib will
call back to us when the link changes so we can update our state.
Therefore, we can bypass most of our ksettings_set() call for this
case.
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Simplify the ksettings_set() implementation to look more like phylib's
implementation; use a switch() for validating the autoneg setting, and
use the linkmode_modify() helper to set the autoneg bit in the
advertisement mask.
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Avoid calling mac_config() when using split PCS, and the interface
remains the same.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The only PHYs that are used with phylink which change their interface
are the BCM84881 and MV88X3310 family, both of which only change their
interface modes on link-up events. This will break when drivers are
converted to split-PCS. Fix this.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The only PHYs that are used with phylink which change their interface
are the BCM84881 and MV88X3310 family, both of which only change their
interface modes on link-up events. However, rather than relying upon
this behaviour by the PHY, we should give a stronger guarantee when
resolving that the link will be down whenever we change the interface
mode. This patch implements that stronger guarantee for resolve.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use a boolean to indicate whether mac_config() should be called during
a resolution. This allows resolution to have a single location where
mac_config() will be called, which will allow us to make decisions
about how and what we do.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rejig the link state tracking, so that we can use the current state
in a future patch.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Comparing the ethtool output from phylink and non-phylink fixed-link
setups shows that we have some differences:
- The "auto-negotiation" fields are different; phylink reports these
as "No", non-phylink reports these as "Yes" for the supported and
advertising masks.
- The link partner advertisement is set to the link speed with non-
phylink, but phylink leaves this unset, causing all link partner
fields to be omitted.
The phylink ethtool output also disagrees with the software emulated
PHY dump via the MII registers.
Update the phylink fixed-link parsing code so that we better reflect
the behaviour of the non-phylink code that this facility replaces, and
bring the ethtool interface more into line with the report from via the
MII interface.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We can't use IS_UDPLITE to replace udp_sk->pcflag when UDPLITE_RECV_CC is
checked.
Fixes: b2bf1e2659b1 ("[UDP]: Clean up for IS_UDPLITE macro")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Claudiu Manoil says:
====================
enetc: Add adaptive interrupt coalescing
Apart from some related cleanup patches, this set
introduces in a straightforward way the support needed
to enable and configure interrupt coalescing for ENETC.
Patch 5 introduces the support needed for configuring the
interrupt coalescing parameters and for switching between
moderated (int. coalescing) and per-packet interrupt modes.
When interrupt coalescing is enabled the Rx/Tx time
thresholds are configurable, packet thresholds are fixed.
To make this work reliably, patch 5 uses the traffic
pause procedure introduced in patch 2.
Patch 6 adds DIM (Dynamic Interrupt Moderation) to implement
adaptive coalescing based on time thresholds, for the Rx 'channel'.
On the Tx side a default optimal value is used instead, optimized for
TCP traffic over 1G and 2.5G links. This default 'optimal' value can
be overridden anytime via 'ethtool -C tx-usecs'.
netperf -t TCP_MAERTS measurements show a significant CPU load
reduction correlated w/ reduced interrupt rates. For the
measurement results refer to the comments in patch 6.
v2: Replaced Tx DIM with predefined optimal value, giving
better results. This was also suggested by Jakub (cc).
Switched order of patches 4 and 5, for better grouping.
v3: minor cleanup/improvements
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the generic dynamic interrupt moderation (dim)
framework to implement adaptive interrupt coalescing
on Rx. With the per-packet interrupt scheme, a high
interrupt rate has been noted for moderate traffic flows
leading to high CPU utilization. The 'dim' scheme
implemented by the current patch addresses this issue
improving CPU utilization while using minimal coalescing
time thresholds in order to preserve a good latency.
On the Tx side use an optimal time threshold value by
default. This value has been optimized for Tx TCP
streams at a rate of around 85kpps on a 1G link,
at which rate half of the Tx ring size (128) gets filled
in 1500 usecs. Scaling this down to 2.5G links yields
the current value of 600 usecs, which is conservative
and gives good enough results for 1G links too (see
next).
Below are some measurement results for before and after
this patch (and related dependencies) basically, for a
2 ARM Cortex-A72 @1.3Ghz CPUs system (32 KB L1 data cache),
using 60secs log netperf TCP stream tests @ 1Gbit link
(maximum throughput):
1) 1 Rx TCP flow, both Rx and Tx processed by the same NAPI
thread on the same CPU:
CPU utilization int rate (ints/sec)
Before: 50%-60% (over 50%) 92k
After: 13%-22% 3.5k-12k
Comment: Major CPU utilization improvement for a single flow
Rx TCP flow (i.e. netperf -t TCP_MAERTS) on a single
CPU. Usually settles under 16% for longer tests.
2) 4 Rx TCP flows + 4 Tx TCP flows (+ pings to check the latency):
Total CPU utilization Total int rate (ints/sec)
Before: ~80% (spikes to 90%) ~100k
After: 60% (more steady) ~4k
Comment: Important improvement for this load test, while the
ping test outcome does not show any notable
difference compared to before.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Enable programming of the interrupt coalescing registers
and allow manual configuration of the coalescing time
thresholds via ethtool. Packet thresholds have been fixed
to predetermined values as there's no point in making them
run-time configurable, also anticipating the dynamic interrupt
moderation (DIM) algorithm which uses fixed packet thresholds
as well. If the interface is up when the operation mode of
traffic interrupt events is changed by the user (i.e. switching
from default per-packet interrupts to coalesced interrupts),
the traffic needs to be paused in the process.
This patch also prepares the ground for introducing DIM on Rx.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
'struct enetc_bdr' is already '____cacheline_aligned_in_smp'.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Interrupt coalescing registers naming in the current revision
of the Ref Man (RM) is ICR, deprecating the ICIR name used
in earlier (draft) versions of the RM.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A reliable traffic pause (and reconfiguration) procedure
is needed to be able to safely make h/w configuration
changes during run-time, like changing the mode in which the
interrupts are operating (i.e. with or without coalescing),
as opposed to making on-the-fly register updates that
may be subject to h/w or s/w concurrency issues.
To this end, the code responsible of the run-time device
configurations that basically starts resp. stops the traffic
flow through the device has been extracted from the
the enetc_open/_close procedures, to the separate standalone
enetc_start/_stop procedures. Traffic stop should be as
graceful as possible, it lets the executing napi threads to
to finish while the interrupts stay disabled. But since
the napi thread will try to re-enable interrupts by clearing
the device's unmask register, the enable_irq/ disable_irq
API has been used to avoid this potential concurrency issue
and make the traffic pause procedure more reliable.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It's time to differentiate between Rx and Tx ring sizes.
Not only Tx rings are processed differently than Rx rings,
but their default number also differs - i.e. up to 8 Tx rings
per device (8 traffic classes) vs. 2 Rx rings (one per CPU).
So let's set Tx rings sizes to half the size of the Rx rings
for now, to be conservative.
The default ring sizes were decreased as well (to the next
lower power of 2), to reduce the memory footprint, buffering
etc., since the measurements I've made so far show that the
rings are very unlikely to get full.
This change also anticipates the introduction of the
dynamic interrupt moderation (dim) algorithm which operates
on maximum packet thresholds of 256 packets for Rx and 128
packets for Tx.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When I cat 'tx_timeout' by sysfs, it displays as follows. It's better to
add a newline for easy reading.
root@syzkaller:~# cat /sys/devices/virtual/net/lo/queues/tx-0/tx_timeout
0root@syzkaller:~#
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use devm_gpiod_get_array() to simplify the error handling and exit
code path.
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
According to the report of [1], this driver is possible to cause
the following error in ravb_tx_timeout_work().
ravb e6800000.ethernet ethernet: failed to switch device to config mode
This error means that the hardware could not change the state
from "Operation" to "Configuration" while some tx and/or rx queue
are operating. After that, ravb_config() in ravb_dmac_init() will fail,
and then any descriptors will be not allocaled anymore so that NULL
pointer dereference happens after that on ravb_start_xmit().
To fix the issue, the ravb_tx_timeout_work() should check
the return values of ravb_stop_dma() and ravb_dmac_init().
If ravb_stop_dma() fails, ravb_tx_timeout_work() re-enables TX and RX
and just exits. If ravb_dmac_init() fails, just exits.
[1]
https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/
Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Kuniyuki Iwashima says:
====================
udp: Fix reuseport selection with connected sockets.
This patch set addresses two issues which happen when both connected and
unconnected sockets are in the same UDP reuseport group.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, SO_REUSEPORT does not work well if connected sockets are in a
UDP reuseport group.
Then reuseport_has_conns() returns true and the result of
reuseport_select_sock() is discarded. Also, unconnected sockets have the
same score, hence only does the first unconnected socket in udp_hslot
always receive all packets sent to unconnected sockets.
So, the result of reuseport_select_sock() should be used for load
balancing.
The noteworthy point is that the unconnected sockets placed after
connected sockets in sock_reuseport.socks will receive more packets than
others because of the algorithm in reuseport_select_sock().
index | connected | reciprocal_scale | result
---------------------------------------------
0 | no | 20% | 40%
1 | no | 20% | 20%
2 | yes | 20% | 0%
3 | no | 20% | 40%
4 | yes | 20% | 0%
If most of the sockets are connected, this can be a problem, but it still
works better than now.
Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets")
CC: Willem de Bruijn <willemb@google.com>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If an unconnected socket in a UDP reuseport group connect()s, has_conns is
set to 1. Then, when a packet is received, udp[46]_lib_lookup2() scans all
sockets in udp_hslot looking for the connected socket with the highest
score.
However, when the number of sockets bound to the port exceeds max_socks,
reuseport_grow() resets has_conns to 0. It can cause udp[46]_lib_lookup2()
to return without scanning all sockets, resulting in that packets sent to
connected sockets may be distributed to unconnected sockets.
Therefore, reuseport_grow() should copy has_conns.
Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets")
CC: Willem de Bruijn <willemb@google.com>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The bpftool sources include code to walk file trees, but use multiple
frameworks to do so: nftw and fts. While nftw conforms to POSIX/SUSv3 and
is widely available, fts is not conformant and less common, especially on
non-glibc systems. The inconsistent framework usage hampers maintenance
and portability of bpftool, in particular for embedded systems.
Standardize code usage by rewriting one fts-based function to use nftw and
clean up some related function warnings by extending use of "const char *"
arguments. This change helps in building bpftool against musl for OpenWrt.
Also fix an unsafe call to dirname() by duplicating the string to pass,
since some implementations may directly alter it. The same approach is
used in libbpf.c.
Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200721024817.13701-1-Tony.Ambardar@gmail.com
|
|
Yonghong Song says:
====================
Commit 5a2798ab32ba
("bpf: Add BTF_ID_LIST/BTF_ID/BTF_ID_UNUSED macros")
implemented a mechanism to compute btf_ids at kernel build
time which can simplify kernel implementation and reduce
runtime overhead by removing in-kernel btf_id calculation.
This patch set tried to use this mechanism to compute
btf_ids for bpf_skc_to_*() helpers and for btf_id_or_null ctx
arguments specified during bpf iterator registration.
Please see individual patch for details.
Changelogs:
v1 -> v2:
- v1 ([1]) is only for bpf_skc_to_*() helpers. This version
expanded it to cover ctx btf_id_or_null arguments
- abandoned the change of "extern u32 name[]" to
"static u32 name[]" for BPF_ID_LIST local "name" definition.
gcc 9 incurred a compilation error.
[1]: https://lore.kernel.org/bpf/20200717184706.3476992-1-yhs@fb.com/T
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A handful of samples and selftests fail to build on s390, because
after commit 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}()
only to archs where they work") bpf_probe_read is not available
anymore.
Fix by using bpf_probe_read_kernel.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720114806.88823-1-iii@linux.ibm.com
|
|
One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com
|
|
OpenBSD netcat (Debian patchlevel 1.195-2) does not seem to react to
SIGINT for whatever reason, causing prefix.pl to hang after
test_lwt_seg6local.sh exits due to netcat inheriting
test_lwt_seg6local.sh's file descriptors.
Fix by using SIGTERM instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720101810.84299-1-iii@linux.ibm.com
|
|
tcp and udp bpf_iter can reuse some socket ids in
btf_sock_ids, so make it global.
I put the extern definition in btf_ids.h as a central
place so it can be easily discovered by developers.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163402.1393427-1-yhs@fb.com
|
|
Luke Nelson says:
====================
his patch series enables using compressed riscv (RVC) instructions
in the rv64 BPF JIT.
RVC is a standard riscv extension that adds a set of compressed,
2-byte instructions that can replace some regular 4-byte instructions
for improved code density.
This series first modifies the JIT to support using 2-byte instructions
(e.g., in jump offset computations), then adds RVC encoding and
helper functions, and finally uses the helper functions to optimize
the rv64 JIT.
I used our formal verification framework, Serval, to verify the
correctness of the RVC encodings and their uses in the rv64 JIT.
The JIT continues to pass all tests in lib/test_bpf.c, and introduces
no new failures to test_verifier; both with and without RVC being enabled.
The following are examples of the JITed code for the verifier selftest
"direct packet read test#3 for CGROUP_SKB OK", without and with RVC
enabled, respectively. The former uses 178 bytes, and the latter uses 112,
for a ~37% reduction in code size for this example.
Without RVC:
0: 02000813 addi a6,zero,32
4: fd010113 addi sp,sp,-48
8: 02813423 sd s0,40(sp)
c: 02913023 sd s1,32(sp)
10: 01213c23 sd s2,24(sp)
14: 01313823 sd s3,16(sp)
18: 01413423 sd s4,8(sp)
1c: 03010413 addi s0,sp,48
20: 03056683 lwu a3,48(a0)
24: 02069693 slli a3,a3,0x20
28: 0206d693 srli a3,a3,0x20
2c: 03456703 lwu a4,52(a0)
30: 02071713 slli a4,a4,0x20
34: 02075713 srli a4,a4,0x20
38: 03856483 lwu s1,56(a0)
3c: 02049493 slli s1,s1,0x20
40: 0204d493 srli s1,s1,0x20
44: 03c56903 lwu s2,60(a0)
48: 02091913 slli s2,s2,0x20
4c: 02095913 srli s2,s2,0x20
50: 04056983 lwu s3,64(a0)
54: 02099993 slli s3,s3,0x20
58: 0209d993 srli s3,s3,0x20
5c: 09056a03 lwu s4,144(a0)
60: 020a1a13 slli s4,s4,0x20
64: 020a5a13 srli s4,s4,0x20
68: 00900313 addi t1,zero,9
6c: 006a7463 bgeu s4,t1,0x74
70: 00000a13 addi s4,zero,0
74: 02d52823 sw a3,48(a0)
78: 02e52a23 sw a4,52(a0)
7c: 02952c23 sw s1,56(a0)
80: 03252e23 sw s2,60(a0)
84: 05352023 sw s3,64(a0)
88: 00000793 addi a5,zero,0
8c: 02813403 ld s0,40(sp)
90: 02013483 ld s1,32(sp)
94: 01813903 ld s2,24(sp)
98: 01013983 ld s3,16(sp)
9c: 00813a03 ld s4,8(sp)
a0: 03010113 addi sp,sp,48
a4: 00078513 addi a0,a5,0
a8: 00008067 jalr zero,0(ra)
With RVC:
0: 02000813 addi a6,zero,32
4: 7179 c.addi16sp sp,-48
6: f422 c.sdsp s0,40(sp)
8: f026 c.sdsp s1,32(sp)
a: ec4a c.sdsp s2,24(sp)
c: e84e c.sdsp s3,16(sp)
e: e452 c.sdsp s4,8(sp)
10: 1800 c.addi4spn s0,sp,48
12: 03056683 lwu a3,48(a0)
16: 1682 c.slli a3,0x20
18: 9281 c.srli a3,0x20
1a: 03456703 lwu a4,52(a0)
1e: 1702 c.slli a4,0x20
20: 9301 c.srli a4,0x20
22: 03856483 lwu s1,56(a0)
26: 1482 c.slli s1,0x20
28: 9081 c.srli s1,0x20
2a: 03c56903 lwu s2,60(a0)
2e: 1902 c.slli s2,0x20
30: 02095913 srli s2,s2,0x20
34: 04056983 lwu s3,64(a0)
38: 1982 c.slli s3,0x20
3a: 0209d993 srli s3,s3,0x20
3e: 09056a03 lwu s4,144(a0)
42: 1a02 c.slli s4,0x20
44: 020a5a13 srli s4,s4,0x20
48: 4325 c.li t1,9
4a: 006a7363 bgeu s4,t1,0x50
4e: 4a01 c.li s4,0
50: d914 c.sw a3,48(a0)
52: d958 c.sw a4,52(a0)
54: dd04 c.sw s1,56(a0)
56: 03252e23 sw s2,60(a0)
5a: 05352023 sw s3,64(a0)
5e: 4781 c.li a5,0
60: 7422 c.ldsp s0,40(sp)
62: 7482 c.ldsp s1,32(sp)
64: 6962 c.ldsp s2,24(sp)
66: 69c2 c.ldsp s3,16(sp)
68: 6a22 c.ldsp s4,8(sp)
6a: 6145 c.addi16sp sp,48
6c: 853e c.mv a0,a5
6e: 8082 c.jr ra
RFC -> v1:
- From Björn Töpel:
* Changed RVOFF macro to static inline "ninsns_rvoff".
* Changed return type of rvc_ functions from u32 to u16.
* Changed sizeof(u16) to sizeof(*ctx->insns).
* Factored unsigned immediate checks into helper functions
(is_8b_uint, etc.)
* Changed to use IS_ENABLED instead of #ifdef to check if RVC is
enabled.
* Changed type of immediate arguments to rvc_* encoding to u32
to avoid issues from promotion of u16 to signed int.
* Cleaned up RVC checks in emit_{addi,slli,srli,srai}.
+ Wrapped lines at 100 instead of 80 columns for increased clarity.
+ Move !imm checks into each branch instead of checking
separately.
+ Strengthed checks for c.{slli,srli,srai} to check that
imm < XLEN. Otherwise, imm could be non-zero but the lower
XLEN bits could all be zero, leading to invalid RVC encoding.
* Changed emit_imm to sign-extend the 12-bit value in "lower"
+ The immediate checks for emit_{addiw,li,addi} use signed
comparisons, so this enables the RVC variants to be used
more often (e.g., if val == -1, then lower should be -1
as opposed to 4095).
====================
Reviewed-by: Björn Töpel <bjorn.topel@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Existing BTF_ID_LIST used a local static variable
to store btf_ids. This patch provided a new macro
BTF_ID_LIST_GLOBAL to store btf_ids in a global
variable which can be shared among multiple files.
The existing BTF_ID_LIST is still retained.
Two reasons. First, BTF_ID_LIST is also used to build
btf_ids for helper arguments which typically
is an array of 5. Since typically different
helpers have different signature, it makes
little sense to share them. Second, some
current computed btf_ids are indeed local.
If later those btf_ids are shared between
different files, they can use BTF_ID_LIST_GLOBAL then.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20200720163401.1393159-1-yhs@fb.com
|
|
Sync kernel header btf_ids.h to tools directory.
Also define macro CONFIG_DEBUG_INFO_BTF before
including btf_ids.h in prog_tests/resolve_btfids.c
since non-stub definitions for BTF_ID_LIST etc. macros
are defined under CONFIG_DEBUG_INFO_BTF. This
prevented test_progs from failing.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163359.1393079-1-yhs@fb.com
|
|
Currently, socket types (struct tcp_sock, udp_sock, etc.)
used by bpf_skc_to_*() helpers are computed when vmlinux_btf
is first built in the kernel.
Commit 5a2798ab32ba
("bpf: Add BTF_ID_LIST/BTF_ID/BTF_ID_UNUSED macros")
implemented a mechanism to compute btf_ids at kernel build
time which can simplify kernel implementation and reduce
runtime overhead by removing in-kernel btf_id calculation.
This patch did exactly this, removing in-kernel btf_id
computation and utilizing build-time btf_id computation.
If CONFIG_DEBUG_INFO_BTF is not defined, BTF_ID_LIST will
define an array with size of 5, which is not enough for
btf_sock_ids. So define its own static array if
CONFIG_DEBUG_INFO_BTF is not defined.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163358.1393023-1-yhs@fb.com
|
|
Fix pass 0 to PTR_ERR, also dump more err info using
libbpf_strerror.
Fixes: 5dc7a8b21144 ("bpftool, selftests/bpf: Embed object file inside skeleton")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200717123059.29624-1-yuehaibing@huawei.com
|
|
This patch uses the RVC support and encodings from bpf_jit.h to optimize
the rv64 jit.
The optimizations work by replacing emit(rv_X(...)) with a call to a
helper function emit_X, which will emit a compressed version of the
instruction when possible, and when RVC is enabled.
The JIT continues to pass all tests in lib/test_bpf.c, and introduces
no new failures to test_verifier; both with and without RVC being enabled.
Most changes are straightforward replacements of emit(rv_X(...), ctx)
with emit_X(..., ctx), with the following exceptions bearing mention;
* Change emit_imm to sign-extend the value in "lower", since the
checks for RVC (and the instructions themselves) treat the value as
signed. Otherwise, small negative immediates will not be recognized as
encodable using an RVC instruction. For example, without this change,
emit_imm(rd, -1, ctx) would cause lower to become 4095, which is not a
6b int even though a "c.li rd, -1" instruction suffices.
* For {BPF_MOV,BPF_ADD} BPF_X, drop using addiw,addw in the 32-bit
cases since the values are zero-extended into the upper 32 bits in
the following instructions anyways, and the addition commutes with
zero-extension. (BPF_SUB BPF_X must still use subw since subtraction
does not commute with zero-extension.)
This patch avoids optimizing branches and jumps to use RVC instructions
since surrounding code often makes assumptions about the sizes of
emitted instructions. Optimizing these will require changing these
functions (e.g., emit_branch) to dynamically compute jump offsets.
The following are examples of the JITed code for the verifier selftest
"direct packet read test#3 for CGROUP_SKB OK", without and with RVC
enabled, respectively. The former uses 178 bytes, and the latter uses 112,
for a ~37% reduction in code size for this example.
Without RVC:
0: 02000813 addi a6,zero,32
4: fd010113 addi sp,sp,-48
8: 02813423 sd s0,40(sp)
c: 02913023 sd s1,32(sp)
10: 01213c23 sd s2,24(sp)
14: 01313823 sd s3,16(sp)
18: 01413423 sd s4,8(sp)
1c: 03010413 addi s0,sp,48
20: 03056683 lwu a3,48(a0)
24: 02069693 slli a3,a3,0x20
28: 0206d693 srli a3,a3,0x20
2c: 03456703 lwu a4,52(a0)
30: 02071713 slli a4,a4,0x20
34: 02075713 srli a4,a4,0x20
38: 03856483 lwu s1,56(a0)
3c: 02049493 slli s1,s1,0x20
40: 0204d493 srli s1,s1,0x20
44: 03c56903 lwu s2,60(a0)
48: 02091913 slli s2,s2,0x20
4c: 02095913 srli s2,s2,0x20
50: 04056983 lwu s3,64(a0)
54: 02099993 slli s3,s3,0x20
58: 0209d993 srli s3,s3,0x20
5c: 09056a03 lwu s4,144(a0)
60: 020a1a13 slli s4,s4,0x20
64: 020a5a13 srli s4,s4,0x20
68: 00900313 addi t1,zero,9
6c: 006a7463 bgeu s4,t1,0x74
70: 00000a13 addi s4,zero,0
74: 02d52823 sw a3,48(a0)
78: 02e52a23 sw a4,52(a0)
7c: 02952c23 sw s1,56(a0)
80: 03252e23 sw s2,60(a0)
84: 05352023 sw s3,64(a0)
88: 00000793 addi a5,zero,0
8c: 02813403 ld s0,40(sp)
90: 02013483 ld s1,32(sp)
94: 01813903 ld s2,24(sp)
98: 01013983 ld s3,16(sp)
9c: 00813a03 ld s4,8(sp)
a0: 03010113 addi sp,sp,48
a4: 00078513 addi a0,a5,0
a8: 00008067 jalr zero,0(ra)
With RVC:
0: 02000813 addi a6,zero,32
4: 7179 c.addi16sp sp,-48
6: f422 c.sdsp s0,40(sp)
8: f026 c.sdsp s1,32(sp)
a: ec4a c.sdsp s2,24(sp)
c: e84e c.sdsp s3,16(sp)
e: e452 c.sdsp s4,8(sp)
10: 1800 c.addi4spn s0,sp,48
12: 03056683 lwu a3,48(a0)
16: 1682 c.slli a3,0x20
18: 9281 c.srli a3,0x20
1a: 03456703 lwu a4,52(a0)
1e: 1702 c.slli a4,0x20
20: 9301 c.srli a4,0x20
22: 03856483 lwu s1,56(a0)
26: 1482 c.slli s1,0x20
28: 9081 c.srli s1,0x20
2a: 03c56903 lwu s2,60(a0)
2e: 1902 c.slli s2,0x20
30: 02095913 srli s2,s2,0x20
34: 04056983 lwu s3,64(a0)
38: 1982 c.slli s3,0x20
3a: 0209d993 srli s3,s3,0x20
3e: 09056a03 lwu s4,144(a0)
42: 1a02 c.slli s4,0x20
44: 020a5a13 srli s4,s4,0x20
48: 4325 c.li t1,9
4a: 006a7363 bgeu s4,t1,0x50
4e: 4a01 c.li s4,0
50: d914 c.sw a3,48(a0)
52: d958 c.sw a4,52(a0)
54: dd04 c.sw s1,56(a0)
56: 03252e23 sw s2,60(a0)
5a: 05352023 sw s3,64(a0)
5e: 4781 c.li a5,0
60: 7422 c.ldsp s0,40(sp)
62: 7482 c.ldsp s1,32(sp)
64: 6962 c.ldsp s2,24(sp)
66: 69c2 c.ldsp s3,16(sp)
68: 6a22 c.ldsp s4,8(sp)
6a: 6145 c.addi16sp sp,48
6c: 853e c.mv a0,a5
6e: 8082 c.jr ra
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20200721025241.8077-4-luke.r.nels@gmail.com
|
|
The non-builtin route for offsetof has a dependency on size_t from
stdlib.h/stdint.h that is undeclared and may break targets.
The offsetof macro in bpf_helpers may disable the same macro in other
headers that have a #ifdef offsetof guard. Rather than add additional
dependencies improve the offsetof macro declared here to use the
builtin that is available since llvm 3.7 (the first with a BPF backend).
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200720061741.1514673-1-irogers@google.com
|
|
This patch adds functions for encoding and emitting compressed riscv
(RVC) instructions to the BPF JIT.
Some regular riscv instructions can be compressed into an RVC instruction
if the instruction fields meet some requirements. For example, "add rd,
rs1, rs2" can be compressed into "c.add rd, rs2" when rd == rs1.
To make using RVC encodings simpler, this patch also adds helper
functions that selectively emit either a regular instruction or a
compressed instruction if possible.
For example, emit_add will produce a "c.add" if possible and regular
"add" otherwise.
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200721025241.8077-3-luke.r.nels@gmail.com
|
|
Now that we have bpf_skip() for emitting nops, use it in
bpf_jit_prologue() in order to reduce code duplication.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717165326.6786-6-iii@linux.ibm.com
|
|
This patch makes the necessary changes to struct rv_jit_context and to
bpf_int_jit_compile to support compressed riscv (RVC) instructions in
the BPF JIT.
It changes the JIT image to be u16 instead of u32, since RVC instructions
are 2 bytes as opposed to 4.
It also changes ctx->offset and ctx->ninsns to refer to 2-byte
instructions rather than 4-byte ones. The riscv PC is required to be
16-bit aligned with or without RVC, so this is sufficient to refer to
any valid riscv offset.
The code for computing jump offsets in bytes is updated accordingly,
and factored into a new "ninsns_rvoff" function to simplify the code.
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200721025241.8077-2-luke.r.nels@gmail.com
|
|
"BPF_MAXINSNS: Maximum possible literals" unnecessarily falls back to
the interpreter because of failing sanity check in bpf_set_addr. The
problem is that there are a lot of branches that can be shrunk, and
doing so opens up the possibility to shrink even more. This process
does not converge after 3 passes, causing code offsets to change during
the codegen pass, which must never happen.
Fix by inserting nops during codegen pass in order to preserve code
offets.
Fixes: 4e9b4a6883dd ("s390/bpf: Use relative long branches")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717165326.6786-5-iii@linux.ibm.com
|
|
"BPF_MAXINSNS: Maximum possible literals" test causes panic with
bpf_jit_harden = 2. The reason is that BPF_JMP | BPF_EXIT is always
emitted as brc, however, after removal of JITed image size
limitations, brcl might be required.
Fix by using brcl when necessary.
Fixes: 4e9b4a6883dd ("s390/bpf: Use relative long branches")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717165326.6786-4-iii@linux.ibm.com
|
|
Both signed and unsigned variants of BPF_JMP | BPF_K require
sign-extending the immediate. JIT emits cgfi for the signed case,
which is correct, and clgfi for the unsigned case, which is not
correct: clgfi zero-extends the immediate.
s390 does not provide an instruction that does sign-extension and
unsigned comparison at the same time. Therefore, fix by first loading
the sign-extended immediate into work register REG_1 and proceeding
as if it's BPF_X.
Fixes: 4e9b4a6883dd ("s390/bpf: Use relative long branches")
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Link: https://lore.kernel.org/bpf/20200717165326.6786-3-iii@linux.ibm.com
|
|
When running out of srctree, relative path to lib/test_bpf.ko is
different than when running in srctree. Check $building_out_of_srctree
environment variable and use a different relative path if needed.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717165326.6786-2-iii@linux.ibm.com
|
|
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When locking pages for delalloc, we check if it's dirty and mapping still
matches. If it does not match, we need to return -EAGAIN and release all
pages. Only the current page was put though, iterate over all the
remaining pages too.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When running tests like generic/013 on test device with btrfs quota
enabled, it can normally lead to data leak, detected at unmount time:
BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
------------[ cut here ]------------
WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace caf08beafeca2392 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
[CAUSE]
In the offending case, the offending operations are:
2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
The following sequence of events could happen after the writev():
CPU1 (writeback) | CPU2 (truncate)
-----------------------------------------------------------------
btrfs_writepages() |
|- extent_write_cache_pages() |
|- Got page for 1003520 |
| 1003520 is Dirty, no writeback |
| So (!clear_page_dirty_for_io()) |
| gets called for it |
|- Now page 1003520 is Clean. |
| | btrfs_setattr()
| | |- btrfs_setsize()
| | |- truncate_setsize()
| | New i_size is 18388
|- __extent_writepage() |
| |- page_offset() > i_size |
|- btrfs_invalidatepage() |
|- Page is clean, so no qgroup |
callback executed
This means, the qgroup reserved data space is not properly released in
btrfs_invalidatepage() as the page is Clean.
[FIX]
Instead of checking the dirty bit of a page, call
btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
io_tree, not bound to page status, thus we won't cause double freeing
anyway.
Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
NULL dereference occurs when string that is not ended with space or
newline is written to some dpm sysfs interface (for example pp_dpm_sclk).
This happens because strsep replaces the tmp with NULL if the delimiter
is not present in string, which is then dereferenced by tmp[0].
Reproduction example:
sudo sh -c 'echo -n 1 > /sys/class/drm/card0/device/pp_dpm_sclk'
Signed-off-by: Paweł Gronowski <me@woland.xyz>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
Avoid kernel crash when vddci_control is SMU7_VOLTAGE_CONTROL_NONE and
vddci_voltage_table is empty. It has been tested on Intel Hades Canyon
(i7-8809G).
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=208489
Fixes: ac7822b0026f ("drm/amd/powerplay: add smumgr support for VEGAM (v2)")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Qiu Wenbo <qiuwenbo@phytium.com.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|