summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-10-29net: phylink: simplify phylink_parse_fixedlink()Russell King (Oracle)
phylink_parse_fixedlink() wants to preserve the pause, asym_pause and autoneg bits in pl->supported. Rather than reading the bits into separate bools, zeroing pl->supported, and then setting them if they were previously set, use a mask and linkmode_and() to achieve the same result. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/E1t3Fh5-000aQi-Nk@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net: usb: qmi_wwan: add Quectel RG650VBenoît Monin
Add support for Quectel RG650V which is based on Qualcomm SDX65 chip. The composition is DIAG / NMEA / AT / AT / QMI. T: Bus=02 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 4 Spd=5000 MxCh= 0 D: Ver= 3.20 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 9 #Cfgs= 1 P: Vendor=2c7c ProdID=0122 Rev=05.15 S: Manufacturer=Quectel S: Product=RG650V-EU S: SerialNumber=xxxxxxx C: #Ifs= 5 Cfg#= 1 Atr=a0 MxPwr=896mA I: If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=01(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms I: If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=82(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=03(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=9ms I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=04(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=85(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=9ms I: If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan E: Ad=05(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=87(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=88(I) Atr=03(Int.) MxPS= 8 Ivl=9ms Signed-off-by: Benoît Monin <benoit.monin@gmx.fr> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024151113.53203-1-benoit.monin@gmx.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29Merge branch 'mlx5e-update-features-on-config-changes'Jakub Kicinski
Tariq Toukan says: ==================== mlx5e update features on config changes This small patchset by Dragos adds a call to netdev_update_features() in configuration changes that could impact the features status. ==================== Link: https://patch.msgid.link/20241024164134.299646-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/mlx5e: Update features on ring size changeDragos Tatulea
When the ring size changes successfully, trigger netdev_update_features() to enable features in wanted state if applicable. An example of such scenario: $ ip link set dev eth1 up $ ethtool --set-ring eth1 rx 8192 $ ip link set dev eth1 mtu 9000 $ ethtool --features eth1 rx-gro-hw on --> fails $ ethtool --set-ring eth1 rx 1024 With this patch, HW GRO will be turned on automatically because it is set in the device's wanted_features. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024164134.299646-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/mlx5e: Update features on MTU changeDragos Tatulea
When the MTU changes successfully, trigger netdev_update_features() to enable features in wanted state if applicable. An example of such scenario: $ ip link set dev eth1 up $ ethtool --set-ring eth1 rx 8192 $ ip link set dev eth1 mtu 9000 $ ethtool --features eth1 rx-gro-hw on --> fails $ ip link set dev eth1 mtu 7000 With this patch, HW GRO will be turned on automatically because it is set in the device's wanted_features. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024164134.299646-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29wwan: core: Pass string literal as format argument of dev_set_name()Simon Horman
Both gcc-14 and clang-18 report that passing a non-string literal as the format argument of dev_set_name() is potentially insecure. E.g. clang-18 says: drivers/net/wwan/wwan_core.c:442:34: warning: format string is not a string literal (potentially insecure) [-Wformat-security] 442 | return dev_set_name(&port->dev, buf); | ^~~ drivers/net/wwan/wwan_core.c:442:34: note: treat the string as an argument to avoid this 442 | return dev_set_name(&port->dev, buf); | ^ | "%s", It is always the case where the contents of mod is safe to pass as the format argument. That is, in my understanding, it never contains any format escape sequences. But, it seems better to be safe than sorry. And, as a bonus, compiler output becomes less verbose by addressing this issue as suggested by clang-18. Compile tested only. No functional change intended. Signed-off-by: Simon Horman <horms@kernel.org> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Link: https://patch.msgid.link/20241023-wwan-fmt-v1-1-521b39968639@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext()Vladimir Oltean
This command: $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact Error: block dev insert failed: -EBUSY. fails because user space requests the same block index to be set for both ingress and egress. [ side note, I don't think it even failed prior to commit 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra"), because this is a command from an old set of notes of mine which used to work, but alas, I did not scientifically bisect this ] The problem is not that it fails, but rather, that the second time around, it fails differently (and irrecoverably): $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact Error: dsa_core: Flow block cb is busy. [ another note: the extack is added by me for illustration purposes. the context of the problem is that clsact_init() obtains the same &q->ingress_block pointer as &q->egress_block, and since we call tcf_block_get_ext() on both of them, "dev" will be added to the block->ports xarray twice, thus failing the operation: once through the ingress block pointer, and once again through the egress block pointer. the problem itself is that when xa_insert() fails, we have emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the offload never sees a corresponding FLOW_BLOCK_UNBIND. ] Even correcting the bad user input, we still cannot recover: $ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact Error: dsa_core: Flow block cb is busy. Basically the only way to recover is to reboot the system, or unbind and rebind the net device driver. To fix the bug, we need to fill the correct error teardown path which was missed during code movement, and call tcf_block_offload_unbind() when xa_insert() fails. [ last note, fundamentally I blame the label naming convention in tcf_block_get_ext() for the bug. The labels should be named after what they do, not after the error path that jumps to them. This way, it is obviously wrong that two labels pointing to the same code mean something is wrong, and checking the code correctness at the goto site is also easier ] Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20241023100541.974362-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29selftests: tc-testing: Fix typo errorKaran Sanghavi
Correct the typo errors in json files - "diffferent" is corrected to "different". - "muliple" and "miltiple" is corrected to "multiple". Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Karan Sanghavi <karansanghvi98@gmail.com> Link: https://patch.msgid.link/20241022-multiple_spell_error-v2-1-7e5036506fe5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29netdevsim: Add trailing zero to terminate the string in ↵Zichen Xie
nsim_nexthop_bucket_activity_write() This was found by a static analyzer. We should not forget the trailing zero after copy_from_user() if we will further do some string operations, sscanf() in this case. Adding a trailing zero will ensure that the function performs properly. Fixes: c6385c0b67c5 ("netdevsim: Allow reporting activity on nexthop buckets") Signed-off-by: Zichen Xie <zichenxie0106@gmail.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20241022171907.8606-1-zichenxie0106@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29selftests/bpf: Test with a very short loopEduard Zingerman
The test added is a simplified reproducer from syzbot report [1]. If verifier does not insert checkpoint somewhere inside the loop, verification of the program would take a very long time. This would happen because mark_chain_precision() for register r7 would constantly trace jump history of the loop back, processing many iterations for each mark_chain_precision() call. [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20241029172641.1042523-2-eddyz87@gmail.com
2024-10-29bpf: Force checkpoint when jmp history is too longEduard Zingerman
A specifically crafted program might trick verifier into growing very long jump history within a single bpf_verifier_state instance. Very long jump history makes mark_chain_precision() unreasonably slow, especially in case if verifier processes a loop. Mitigate this by forcing new state in is_state_visited() in case if current state's jump history is too long. Use same constant as in `skip_inf_loop_check`, but multiply it by arbitrarily chosen value 2 to account for jump history containing not only information about jumps, but also information about stack access. For an example of problematic program consider the code below, w/o this patch the example is processed by verifier for ~15 minutes, before failing to allocate big-enough chunk for jmp_history. 0: r7 = *(u16 *)(r1 +0);" 1: r7 += 0x1ab064b9;" 2: if r7 & 0x702000 goto 1b; 3: r7 &= 0x1ee60e;" 4: r7 += r1;" 5: if r7 s> 0x37d2 goto +0;" 6: r0 = 0;" 7: exit;" Perf profiling shows that most of the time is spent in mark_chain_precision() ~95%. The easiest way to explain why this program causes problems is to apply the following patch: diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0c216e71cec7..4b4823961abe 100644 \--- a/include/linux/bpf.h \+++ b/include/linux/bpf.h \@@ -1926,7 +1926,7 @@ struct bpf_array { }; }; -#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */ +#define BPF_COMPLEXITY_LIMIT_INSNS 256 /* yes. 1M insns */ #define MAX_TAIL_CALL_CNT 33 /* Maximum number of loops for bpf_loop and bpf_iter_num. diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f514247ba8ba..75e88be3bb3e 100644 \--- a/kernel/bpf/verifier.c \+++ b/kernel/bpf/verifier.c \@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) skip_inf_loop_check: if (!force_new_state && env->jmps_processed - env->prev_jmps_processed < 20 && - env->insn_processed - env->prev_insn_processed < 100) + env->insn_processed - env->prev_insn_processed < 100) { + verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n", + env->insn_idx, + env->jmps_processed - env->prev_jmps_processed, + cur->jmp_history_cnt); add_new_state = false; + } goto miss; } /* If sl->state is a part of a loop and this loop's entry is a part of \@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) if (!add_new_state) return 0; + verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n", + env->insn_idx); + /* There were no equivalent states, remember the current one. * Technically the current state is not proven to be safe yet, * but it will either reach outer most bpf_exit (which means it's safe) And observe verification log: ... is_state_visited: new checkpoint at 5, resetting env->jmps_processed 5: R1=ctx() R7=ctx(...) 5: (65) if r7 s> 0x37d2 goto pc+0 ; R7=ctx(...) 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0 6: R1=ctx() R7=ctx(...) R10=fp0 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74 from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... mark_precise 152 steps for r7 ... 2: R7_w=scalar(...) is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... BPF program is too large. Processed 257 insn The log output shows that checkpoint at label (1) is never created, because it is suppressed by `skip_inf_loop_check` logic: a. When 'if' at (2) is processed it pushes a state with insn_idx (1) onto stack and proceeds to (3); b. At (5) checkpoint is created, and this resets env->{jmps,insns}_processed. c. Verification proceeds and reaches `exit`; d. State saved at step (a) is popped from stack and is_state_visited() considers if checkpoint needs to be added, but because env->{jmps,insns}_processed had been just reset at step (b) the `skip_inf_loop_check` logic forces `add_new_state` to false. e. Verifier proceeds with current state, which slowly accumulates more and more entries in the jump history. The accumulation of entries in the jump history is a problem because of two factors: - it eventually exhausts memory available for kmalloc() allocation; - mark_chain_precision() traverses the jump history of a state, meaning that if `r7` is marked precise, verifier would iterate ever growing jump history until parent state boundary is reached. (note: the log also shows a REG INVARIANTS VIOLATION warning upon jset processing, but that's another bug to fix). With this patch applied, the example above is rejected by verifier under 1s of time, reaching 1M instructions limit. The program is a simplified reproducer from syzbot report. Previous discussion could be found at [1]. The patch does not cause any changes in verification performance, when tested on selftests from veristat.cfg and cilium programs taken from [2]. [1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/ [2] https://github.com/anakryiko/cilium Changelog: - v1 -> v2: - moved patch to bpf tree; - moved force_new_state variable initialization after declaration and shortened the comment. v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/ Fixes: 2589726d12a1 ("bpf: introduce bounded loops") Reported-by: syzbot+7e46cdef14bf496a3ab4@syzkaller.appspotmail.com Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com Closes: https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
2024-10-29rtnetlink: Fix kdoc of rtnl_af_register().Kuniyuki Iwashima
Commit 26eebdc4b005 ("rtnetlink: Return int from rtnl_af_register().") made rtnl_af_register() return int again, and kdoc needs to be fixed up. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241022210320.86111-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/sched: stop qdisc_tree_reduce_backlog on TC_H_ROOTPedro Tammela
In qdisc_tree_reduce_backlog, Qdiscs with major handle ffff: are assumed to be either root or ingress. This assumption is bogus since it's valid to create egress qdiscs with major handle ffff: Budimir Markovic found that for qdiscs like DRR that maintain an active class list, it will cause a UAF with a dangling class pointer. In 066a3b5b2346, the concern was to avoid iterating over the ingress qdisc since its parent is itself. The proper fix is to stop when parent TC_H_ROOT is reached because the only way to retrieve ingress is when a hierarchy which does not contain a ffff: major handle call into qdisc_lookup with TC_H_MAJ(TC_H_ROOT). In the scenario where major ffff: is an egress qdisc in any of the tree levels, the updates will also propagate to TC_H_ROOT, which then the iteration must stop. Fixes: 066a3b5b2346 ("[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> net/sched/sch_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024165547.418570-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29selftests: netfilter: nft_flowtable.sh: make first pass deterministicFlorian Westphal
The CI occasionaly encounters a failing test run. Example: # PASS: ipsec tunnel mode for ns1/ns2 # re-run with random mtus: -o 10966 -l 19499 -r 31322 # PASS: flow offloaded for ns1/ns2 [..] # FAIL: ipsec tunnel ... counter 1157059 exceeds expected value 878489 This script will re-exec itself, on the second run, random MTUs are chosen for the involved links. This is done so we can cover different combinations (large mtu on client, small on server, link has lowest mtu, etc). Furthermore, file size is random, even for the first run. Rework this script and always use the same file size on initial run so that at least the first round can be expected to have reproducible behavior. Second round will use random mtu/filesize. Raise the failure limit to that of the file size, this should avoid all errneous test errors. Currently, first fin will remove the offload, so if one peer is already closing remaining data is handled by classic path, which result in larger-than-expected counter and a test failure. Given packet path also counts tcp/ip headers, in case offload is completely broken this test will still fail (as expected). The test counter limit could be made more strict again in the future once flowtable can keep a connection in offloaded state until FINs in both directions were seen. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241022152324.13554-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29gtp: allow -1 to be specified as file description from userspacePablo Neira Ayuso
Existing user space applications maintained by the Osmocom project are breaking since a recent fix that addresses incorrect error checking. Restore operation for user space programs that specify -1 as file descriptor to skip GTPv0 or GTPv1 only sockets. Fixes: defd8b3c37b0 ("gtp: fix a potential NULL pointer dereference") Reported-by: Pau Espin Pedrol <pespin@sysmocom.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Oliver Smith <osmith@sysmocom.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241022144825.66740-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29mctp i2c: handle NULL header addressMatt Johnston
daddr can be NULL if there is no neighbour table entry present, in that case the tx packet should be dropped. saddr will usually be set by MCTP core, but check for NULL in case a packet is transmitted by a different protocol. Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver") Cc: stable@vger.kernel.org Reported-by: Dung Cao <dung@os.amperecomputing.com> Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241022-mctp-i2c-null-dest-v3-1-e929709956c5@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29Merge branch 'ipv4-prepare-core-ipv4-files-to-future-flowi4_tos-conversion'Jakub Kicinski
Guillaume Nault says: ==================== ipv4: Prepare core ipv4 files to future .flowi4_tos conversion. Continue preparing users of ->flowi4_tos (struct flowi4) to the future conversion of this field (from __u8 to dscp_t). The objective is to have type annotation to properly separate DSCP bits from ECN ones. This way we'll ensure that ECN doesn't interfere with DSCP and avoid regressions where it break routing descisions (fib rules in particular). This series concentrates on some easy IPv4 conversions where ->flowi4_tos is set directly from an IPv4 header, so we can get the DSCP value using the ip4h_dscp() helper function. ==================== Link: https://patch.msgid.link/cover.1729530028.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29ipv4: Prepare ip_rt_get_source() to future .flowi4_tos conversion.Guillaume Nault
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0a13a200f31809841975e38633914af1061e0c04.1729530028.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29ipv4: Prepare ipmr_rt_fib_lookup() to future .flowi4_tos conversion.Guillaume Nault
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/462402a097260357a7aba80228612305f230b6a9.1729530028.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29ipv4: Prepare icmp_reply() to future .flowi4_tos conversion.Guillaume Nault
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/61b7563563f8b0a562b5b62032fe5260034d0aac.1729530028.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29ipv4: Prepare fib_compute_spec_dst() to future .flowi4_tos conversion.Guillaume Nault
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/a0eba69cce94f747e4c7516184a85ffd0abbe3f0.1729530028.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29ipv4: ip_tunnel: Fix suspicious RCU usage warning in ip_tunnel_find()Ido Schimmel
The per-netns IP tunnel hash table is protected by the RTNL mutex and ip_tunnel_find() is only called from the control path where the mutex is taken. Add a lockdep expression to hlist_for_each_entry_rcu() in ip_tunnel_find() in order to validate that the mutex is held and to silence the suspicious RCU usage warning [1]. [1] WARNING: suspicious RCU usage 6.12.0-rc3-custom-gd95d9a31aceb #139 Not tainted ----------------------------- net/ipv4/ip_tunnel.c:221 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ip/362: #0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60 stack backtrace: CPU: 12 UID: 0 PID: 362 Comm: ip Not tainted 6.12.0-rc3-custom-gd95d9a31aceb #139 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0xba/0x110 lockdep_rcu_suspicious.cold+0x4f/0xd6 ip_tunnel_find+0x435/0x4d0 ip_tunnel_newlink+0x517/0x7a0 ipgre_newlink+0x14c/0x170 __rtnl_newlink+0x1173/0x19c0 rtnl_newlink+0x6c/0xa0 rtnetlink_rcv_msg+0x3cc/0xf60 netlink_rcv_skb+0x171/0x450 netlink_unicast+0x539/0x7f0 netlink_sendmsg+0x8c1/0xd80 ____sys_sendmsg+0x8f9/0xc20 ___sys_sendmsg+0x197/0x1e0 __sys_sendmsg+0x122/0x1f0 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241023123009.749764-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29ipv4: ip_tunnel: Fix suspicious RCU usage warning in ip_tunnel_init_flow()Ido Schimmel
There are code paths from which the function is called without holding the RCU read lock, resulting in a suspicious RCU usage warning [1]. Fix by using l3mdev_master_upper_ifindex_by_index() which will acquire the RCU read lock before calling l3mdev_master_upper_ifindex_by_index_rcu(). [1] WARNING: suspicious RCU usage 6.12.0-rc3-custom-gac8f72681cf2 #141 Not tainted ----------------------------- net/core/dev.c:876 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ip/361: #0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60 stack backtrace: CPU: 3 UID: 0 PID: 361 Comm: ip Not tainted 6.12.0-rc3-custom-gac8f72681cf2 #141 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0xba/0x110 lockdep_rcu_suspicious.cold+0x4f/0xd6 dev_get_by_index_rcu+0x1d3/0x210 l3mdev_master_upper_ifindex_by_index_rcu+0x2b/0xf0 ip_tunnel_bind_dev+0x72f/0xa00 ip_tunnel_newlink+0x368/0x7a0 ipgre_newlink+0x14c/0x170 __rtnl_newlink+0x1173/0x19c0 rtnl_newlink+0x6c/0xa0 rtnetlink_rcv_msg+0x3cc/0xf60 netlink_rcv_skb+0x171/0x450 netlink_unicast+0x539/0x7f0 netlink_sendmsg+0x8c1/0xd80 ____sys_sendmsg+0x8f9/0xc20 ___sys_sendmsg+0x197/0x1e0 __sys_sendmsg+0x122/0x1f0 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: db53cd3d88dc ("net: Handle l3mdev in ip_tunnel_init_flow") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241022063822.462057-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29bpf: fix filed access without lockJiayuan Chen
The tcp_bpf_recvmsg_parser() function, running in user context, retrieves seq_copied from tcp_sk without holding the socket lock, and stores it in a local variable seq. However, the softirq context can modify tcp_sk->seq_copied concurrently, for example, n tcp_read_sock(). As a result, the seq value is stale when it is assigned back to tcp_sk->copied_seq at the end of tcp_bpf_recvmsg_parser(), leading to incorrect behavior. Due to concurrency, the copied_seq field in tcp_bpf_recvmsg_parser() might be set to an incorrect value (less than the actual copied_seq) at the end of function: 'WRITE_ONCE(tcp->copied_seq, seq)'. This causes the 'offset' to be negative in tcp_read_sock()->tcp_recv_skb() when processing new incoming packets (sk->copied_seq - skb->seq becomes less than 0), and all subsequent packets will be dropped. Signed-off-by: Jiayuan Chen <mrpre@163.com> Link: https://lore.kernel.org/r/20241028065226.35568-1-mrpre@163.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-29Merge branch 'ibm-emac-more-cleanups'Paolo Abeni
Rosen Penev says: ==================== ibm: emac: more cleanups Tested on Cisco MX60W. v2: fixed build errors. Also added extra commits to clean the driver up further. v3: Added tested message. Removed bad alloc_netdev_dummy commit. v4: removed modules changes from patchset. Added fix for if MAC not found. v5: added of_find_matching_node commit. v6: resend after net-next merge. v7: removed of_find_matching_node commit. Adjusted mutex_init patch. v8: removed patch removing custom init/exit. Needs more work. ==================== Link: https://patch.msgid.link/20241022002245.843242-1-rosenp@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: ibm: emac: generate random MAC if not foundRosen Penev
On this Cisco MX60W, u-boot sets the local-mac-address property. Unfortunately by default, the MAC is wrong and is actually located on a UBI partition. Which means nvmem needs to be used to grab it. In the case where that fails, EMAC fails to initialize instead of generating a random MAC as many other drivers do. Match behavior with other drivers to have a working ethernet interface. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: ibm: emac: use devm for mutex_initRosen Penev
It seems since inception that mutex_destroy was never called for these in _remove. Instead of handling this manually, just use devm for simplicity. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: ibm: emac: use platform_get_irqRosen Penev
No need for irq_of_parse_and_map since we have platform_device. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: ibm: emac: use devm_platform_ioremap_resourceRosen Penev
No need to have a struct resource. Gets rid of the TODO. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: ibm: emac: use netif_receive_skb_listRosen Penev
Small rx improvement. Would use napi_gro_receive instead but that's a lot more involved than netif_receive_skb_list because of how the function is implemented. Before: > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 51556 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.04 sec 559 MBytes 467 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 48228 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.03 sec 558 MBytes 467 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 47600 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.04 sec 557 MBytes 466 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 37252 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.05 sec 559 MBytes 467 Mbits/sec After: > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 40786 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.05 sec 572 MBytes 478 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 52482 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.04 sec 571 MBytes 477 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 48370 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.04 sec 572 MBytes 478 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 46086 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.05 sec 571 MBytes 476 Mbits/sec > iperf -c 192.168.1.1 ------------------------------------------------------------ Client connecting to 192.168.1.1, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.101 port 46062 connected with 192.168.1.1 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.04 sec 572 MBytes 478 Mbits/sec Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29Merge branch 'intel-wired-lan-driver-fixes-2024-10-21-igb-ice'Paolo Abeni
Jacob Keller says: ==================== Intel Wired LAN Driver Fixes 2024-10-21 (igb, ice) This series includes fixes for the ice and igb drivers. Wander fixes an issue in igb when operating on PREEMPT_RT kernels due to the PREEMPT_RT kernel switching IRQs to be threaded by default. Michal fixes the ice driver to block subfunction port creation when the PF is operating in legacy (non-switchdev) mode. Arkadiusz fixes a crash when loading the ice driver on an E810 LOM which has DPLL enabled. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> ==================== Link: https://patch.msgid.link/20241021-iwl-2024-10-21-iwl-net-fixes-v1-0-a50cb3059f55@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ice: fix crash on probe for DPLL enabled E810 LOMArkadiusz Kubalewski
The E810 Lan On Motherboard (LOM) design is vendor specific. Intel provides the reference design, but it is up to vendor on the final product design. For some cases, like Linux DPLL support, the static values defined in the driver does not reflect the actual LOM design. Current implementation of dpll pins is causing the crash on probe of the ice driver for such DPLL enabled E810 LOM designs: WARNING: (...) at drivers/dpll/dpll_core.c:495 dpll_pin_get+0x2c4/0x330 ... Call Trace: <TASK> ? __warn+0x83/0x130 ? dpll_pin_get+0x2c4/0x330 ? report_bug+0x1b7/0x1d0 ? handle_bug+0x42/0x70 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? dpll_pin_get+0x117/0x330 ? dpll_pin_get+0x2c4/0x330 ? dpll_pin_get+0x117/0x330 ice_dpll_get_pins.isra.0+0x52/0xe0 [ice] ... The number of dpll pins enabled by LOM vendor is greater than expected and defined in the driver for Intel designed NICs, which causes the crash. Prevent the crash and allow generic pin initialization within Linux DPLL subsystem for DPLL enabled E810 LOM designs. Newly designed solution for described issue will be based on "per HW design" pin initialization. It requires pin information dynamically acquired from the firmware and is already in progress, planned for next-tree only. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Reviewed-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ice: block SF port creation in legacy modeMichal Swiatkowski
There is no support for SF in legacy mode. Reflect it in the code. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Fixes: eda69d654c7e ("ice: add basic devlink subfunctions support") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29igb: Disable threaded IRQ for igb_msix_otherWander Lairson Costa
During testing of SR-IOV, Red Hat QE encountered an issue where the ip link up command intermittently fails for the igbvf interfaces when using the PREEMPT_RT variant. Investigation revealed that e1000_write_posted_mbx returns an error due to the lack of an ACK from e1000_poll_for_ack. The underlying issue arises from the fact that IRQs are threaded by default under PREEMPT_RT. While the exact hardware details are not available, it appears that the IRQ handled by igb_msix_other must be processed before e1000_poll_for_ack times out. However, e1000_write_posted_mbx is called with preemption disabled, leading to a scenario where the IRQ is serviced only after the failure of e1000_write_posted_mbx. To resolve this, we set IRQF_NO_THREAD for the affected interrupt, ensuring that the kernel handles it immediately, thereby preventing the aforementioned error. Reproducer: #!/bin/bash # echo 2 > /sys/class/net/ens14f0/device/sriov_numvfs ipaddr_vlan=3 nic_test=ens14f0 vf=${nic_test}v0 while true; do ip link set ${nic_test} mtu 1500 ip link set ${vf} mtu 1500 ip link set $vf up ip link set ${nic_test} vf 0 vlan ${ipaddr_vlan} ip addr add 172.30.${ipaddr_vlan}.1/24 dev ${vf} ip addr add 2021:db8:${ipaddr_vlan}::1/64 dev ${vf} if ! ip link show $vf | grep 'state UP'; then echo 'Error found' break fi ip link set $vf down done Signed-off-by: Wander Lairson Costa <wander@redhat.com> Fixes: 9d5c824399de ("igb: PCI-Express 82575 Gigabit Ethernet driver") Reported-by: Yuying Ma <yuma@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ASoC: codecs: wcd937x: relax the AUX PDM watchdogAlexey Klimov
On a system with wcd937x, rxmacro and Qualcomm audio DSP, which is pretty common set of devices on Qualcomm platforms, and due to the order of how DAPM widgets are powered on (they are sorted), there is a small time window when wcd937x chip is online and expects the flow of incoming data but rxmacro is not yet online. When wcd937x is programmed to receive data via AUX port then its AUX PDM watchdog is enabled in wcd937x_codec_enable_aux_pa(). If due to some reasons the rxmacro and soundwire machinery are delayed to start streaming data, then there is a chance for this AUX PDM watchdog to reset the wcd937x codec. Such event is not logged as a message and only wcd937x IRQ counter is increased however there could be a lot of other reasons for that IRQ. There is a similar opportunity for such delay during DAPM widgets power down sequence. If wcd937x codec reset happens on the start of the playback, then there will be no sound and if such reset happens at the end of a playback then it may generate additional clicks and pops noises. On qrb4210 RB2 board without any debugging bits the wcd937x resets are sometimes observed at the end of a playback though not always. With some debugging messages or with some tracing enabled the AUX PDM watchdog resets the wcd937x codec at the start of a playback and there is no sound output at all. In this patch: - TIMEOUT_SEL bit in PDM_WD_CTL2 register is set to increase the watchdog reset delay to 100ms which eliminates the AUX PDM watchdog IRQs on qrb4210 RB2 board completely and decreases the number of unwanted clicks noises; - HOLD_OFF bit postpones triggering such watchdog IRQ till wcd937x codec reset which usually happens at the end of a playback. This allows to actually output some sound in case of debugging. Cc: Adam Skladowski <a39.skl@gmail.com> Cc: Mohammad Rafi Shaik <quic_mohs@quicinc.com> Cc: Prasad Kumpatla <quic_pkumpatl@quicinc.com> Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://patch.msgid.link/20241022033132.787416-3-alexey.klimov@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-10-29ASoC: codecs: wcd937x: add missing LO Switch controlAlexey Klimov
The wcd937x supports also AUX input but the control that sets correct soundwire port for this is missing. This control is required for audio playback, for instance, on qrb4210 RB2 board as well as on other SoCs. Reported-by: Adam Skladowski <a39.skl@gmail.com> Reported-by: Prasad Kumpatla <quic_pkumpatl@quicinc.com> Suggested-by: Adam Skladowski <a39.skl@gmail.com> Suggested-by: Prasad Kumpatla <quic_pkumpatl@quicinc.com> Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Cc: Mohammad Rafi Shaik <quic_mohs@quicinc.com> Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://patch.msgid.link/20241022033132.787416-2-alexey.klimov@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-10-29ASoC: dt-bindings: rockchip,rk3308-codec: add port propertyDmitry Yashin
Fix DTB warnings when rk3308-codec used with audio-graph-card by documenting port property: codec@ff560000: 'port' does not match any of the regexes: 'pinctrl-[0-9]+' Signed-off-by: Dmitry Yashin <dmt.yashin@gmail.com> Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Link: https://patch.msgid.link/20241028213314.476776-2-dmt.yashin@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-10-29Merge branch 'ipv4-convert-rtm_-new-del-addr-and-more-to-per-netns-rtnl'Paolo Abeni
Kuniyuki Iwashima says: ==================== ipv4: Convert RTM_{NEW,DEL}ADDR and more to per-netns RTNL. The IPv4 address hash table and GC are already namespacified. This series converts RTM_NEWADDR/RTM_DELADDR and some more RTNL users to per-netns RTNL. Changes: v2: * Add patch 1 to address sparse warning for CONFIG_DEBUG_NET_SMALL_RTNL=n * Add Eric's tags to patch 2-12 v1: https://lore.kernel.org/netdev/20241018012225.90409-1-kuniyu@amazon.com/ ==================== Link: https://patch.msgid.link/20241021183239.79741-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert devinet_ioctl to per-netns RTNL.Kuniyuki Iwashima
ioctl(SIOCGIFCONF) calls dev_ifconf() that operates on the current netns. Let's use per-netns RTNL helpers in dev_ifconf() and inet_gifconf(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert devinet_ioctl() to per-netns RTNL except for SIOCSIFFLAGS.Kuniyuki Iwashima
Basically, devinet_ioctl() operates on a single netns. However, ioctl(SIOCSIFFLAGS) will trigger the netdev notifier that could touch another netdev in different netns. Let's use per-netns RTNL helper in devinet_ioctl() and place ASSERT_RTNL() for SIOCSIFFLAGS. We will remove ASSERT_RTNL() once RTM_SETLINK and RTM_DELLINK are converted. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert devinet_sysctl_forward() to per-netns RTNL.Kuniyuki Iwashima
devinet_sysctl_forward() touches only a single netns. Let's use rtnl_trylock() and __in_dev_get_rtnl_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29rtnetlink: Define rtnl_net_trylock().Kuniyuki Iwashima
We will need the per-netns version of rtnl_trylock(). rtnl_net_trylock() calls __rtnl_net_lock() only when rtnl_trylock() successfully holds RTNL. When RTNL is removed, we will use mutex_trylock() for per-netns RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert check_lifetime() to per-netns RTNL.Kuniyuki Iwashima
Since commit 1675f385213e ("ipv4: Namespacify IPv4 address GC."), check_lifetime() works on a per-netns basis. Let's use rtnl_net_lock() and rtnl_net_dereference(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert RTM_DELADDR to per-netns RTNL.Kuniyuki Iwashima
Let's push down RTNL into inet_rtm_deladdr() as rtnl_net_lock(). Now, ip_mc_autojoin_config() is always called under per-netns RTNL, so ASSERT_RTNL() can be replaced with ASSERT_RTNL_NET(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Use per-netns RTNL helpers in inet_rtm_newaddr().Kuniyuki Iwashima
inet_rtm_to_ifa() and find_matching_ifa() are called under rtnl_net_lock(). __in_dev_get_rtnl() and in_dev_for_each_ifa_rtnl() there can use per-netns RTNL helpers. Let's define and use __in_dev_get_rtnl_net() and in_dev_for_each_ifa_rtnl_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert RTM_NEWADDR to per-netns RTNL.Kuniyuki Iwashima
The address hash table and GC are already namespacified. Let's push down RTNL into inet_rtm_newaddr() as rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Don't allocate ifa for 0.0.0.0 in inet_rtm_newaddr().Kuniyuki Iwashima
When we pass 0.0.0.0 to __inet_insert_ifa(), it frees ifa and returns 0. We can do this check much earlier for RTM_NEWADDR even before allocating struct in_ifaddr. Let's move the validation to 1. inet_insert_ifa() for ioctl() 2. inet_rtm_newaddr() for RTM_NEWADDR Now, we can remove the same check in find_matching_ifa(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Factorise RTM_NEWADDR validation to inet_validate_rtm().Kuniyuki Iwashima
rtm_to_ifaddr() validates some attributes, looks up a netdev, allocates struct in_ifaddr, and validates IFA_CACHEINFO. There is no reason to delay IFA_CACHEINFO validation. We will push RTNL down to inet_rtm_newaddr(), and then we want to complete rtnetlink validation before rtnl_net_lock(). Let's factorise the validation parts. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29rtnetlink: Define RTNL_FLAG_DOIT_PERNET for per-netns RTNL doit().Kuniyuki Iwashima
We will push RTNL down to each doit() as rtnl_net_lock(). We can use RTNL_FLAG_DOIT_UNLOCKED to call doit() without RTNL, but doit() will still hold RTNL. Let's define RTNL_FLAG_DOIT_PERNET as an alias of RTNL_FLAG_DOIT_UNLOCKED. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29rtnetlink: Make per-netns RTNL dereference helpers to macro.Kuniyuki Iwashima
When CONFIG_DEBUG_NET_SMALL_RTNL is off, rtnl_net_dereference() is the static inline wrapper of rtnl_dereference() returning a plain (void *) pointer to make sure net is always evaluated as requested in [0]. But, it makes sparse complain [1] when the pointer has __rcu annotation: net/ipv4/devinet.c:674:47: sparse: warning: incorrect type in argument 2 (different address spaces) net/ipv4/devinet.c:674:47: sparse: expected void *p net/ipv4/devinet.c:674:47: sparse: got struct in_ifaddr [noderef] __rcu * Also, if we evaluate net as (void *) in a macro, then the compiler in turn fails to build due to -Werror=unused-value. #define rtnl_net_dereference(net, p) \ ({ \ (void *)net; \ rtnl_dereference(p); \ }) net/ipv4/devinet.c: In function ‘inet_rtm_deladdr’: ./include/linux/rtnetlink.h:154:17: error: statement with no effect [-Werror=unused-value] 154 | (void *)net; \ net/ipv4/devinet.c:674:21: note: in expansion of macro ‘rtnl_net_dereference’ 674 | (ifa = rtnl_net_dereference(net, *ifap)) != NULL; | ^~~~~~~~~~~~~~~~~~~~ Let's go back to the original simplest macro. Note that checkpatch complains about this approach, but it's one-shot and less noisy than the other two. WARNING: Argument 'net' is not used in function-like macro #76: FILE: include/linux/rtnetlink.h:142: +#define rtnl_net_dereference(net, p) \ + rtnl_dereference(p) Fixes: 844e5e7e656d ("rtnetlink: Add assertion helpers for per-netns RTNL.") Link: https://lore.kernel.org/netdev/20241004132145.7fd208e9@kernel.org/ [0] Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410200325.SaEJmyZS-lkp@intel.com/ [1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>