summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-10-09mptcp: fallback when MPTCP opts are dropped after 1st dataMatthieu Baerts (NGI0)
As reported by Christoph [1], before this patch, an MPTCP connection was wrongly reset when a host received a first data packet with MPTCP options after the 3wHS, but got the next ones without. According to the MPTCP v1 specs [2], a fallback should happen in this case, because the host didn't receive a DATA_ACK from the other peer, nor receive data for more than the initial window which implies a DATA_ACK being received by the other peer. The patch here re-uses the same logic as the one used in other places: by looking at allow_infinite_fallback, which is disabled at the creation of an additional subflow. It's not looking at the first DATA_ACK (or implying one received from the other side) as suggested by the RFC, but it is in continuation with what was already done, which is safer, and it fixes the reported issue. The next step, looking at this first DATA_ACK, is tracked in [4]. This patch has been validated using the following Packetdrill script: 0 socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // 3WHS is OK +0.0 < S 0:0(0) win 65535 <mss 1460, sackOK, nop, nop, nop, wscale 6, mpcapable v1 flags[flag_h] nokey> +0.0 > S. 0:0(0) ack 1 <mss 1460, nop, nop, sackOK, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey]> +0.1 < . 1:1(0) ack 1 win 2048 <mpcapable v1 flags[flag_h] key[ckey=2, skey]> +0 accept(3, ..., ...) = 4 // Data from the client with valid MPTCP options (no DATA_ACK: normal) +0.1 < P. 1:501(500) ack 1 win 2048 <mpcapable v1 flags[flag_h] key[skey, ckey] mpcdatalen 500, nop, nop> // From here, the MPTCP options will be dropped by a middlebox +0.0 > . 1:1(0) ack 501 <dss dack8=501 dll=0 nocs> +0.1 read(4, ..., 500) = 500 +0 write(4, ..., 100) = 100 // The server replies with data, still thinking MPTCP is being used +0.0 > P. 1:101(100) ack 501 <dss dack8=501 dsn8=1 ssn=1 dll=100 nocs, nop, nop> // But the client already did a fallback to TCP, because the two previous packets have been received without MPTCP options +0.1 < . 501:501(0) ack 101 win 2048 +0.0 < P. 501:601(100) ack 101 win 2048 // The server should fallback to TCP, not reset: it didn't get a DATA_ACK, nor data for more than the initial window +0.0 > . 101:101(0) ack 601 Note that this script requires Packetdrill with MPTCP support, see [3]. Fixes: dea2b1ea9c70 ("mptcp: do not reset MP_CAPABLE subflow on mapping errors") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/518 [1] Link: https://datatracker.ietf.org/doc/html/rfc8684#name-fallback [2] Link: https://github.com/multipath-tcp/packetdrill [3] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/519 [4] Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-3-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09tcp: fix mptcp DSS corruption due to large pmtu xmitPaolo Abeni
Syzkaller was able to trigger a DSS corruption: TCP: request_sock_subflow_v4: Possible SYN flooding on port [::]:20002. Sending cookies. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 5227 at net/mptcp/protocol.c:695 __mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695 Modules linked in: CPU: 0 UID: 0 PID: 5227 Comm: syz-executor350 Not tainted 6.11.0-syzkaller-08829-gaf9c191ac2a0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 RIP: 0010:__mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695 Code: 0f b6 dc 31 ff 89 de e8 b5 dd ea f5 89 d8 48 81 c4 50 01 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 98 da ea f5 90 <0f> 0b 90 e9 47 ff ff ff e8 8a da ea f5 90 0f 0b 90 e9 99 e0 ff ff RSP: 0018:ffffc90000006db8 EFLAGS: 00010246 RAX: ffffffff8ba9df18 RBX: 00000000000055f0 RCX: ffff888030023c00 RDX: 0000000000000100 RSI: 00000000000081e5 RDI: 00000000000055f0 RBP: 1ffff110062bf1ae R08: ffffffff8ba9cf12 R09: 1ffff110062bf1b8 R10: dffffc0000000000 R11: ffffed10062bf1b9 R12: 0000000000000000 R13: dffffc0000000000 R14: 00000000700cec61 R15: 00000000000081e5 FS: 000055556679c380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020287000 CR3: 0000000077892000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> move_skbs_to_msk net/mptcp/protocol.c:811 [inline] mptcp_data_ready+0x29c/0xa90 net/mptcp/protocol.c:854 subflow_data_ready+0x34a/0x920 net/mptcp/subflow.c:1490 tcp_data_queue+0x20fd/0x76c0 net/ipv4/tcp_input.c:5283 tcp_rcv_established+0xfba/0x2020 net/ipv4/tcp_input.c:6237 tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915 tcp_v4_rcv+0x2dc0/0x37f0 net/ipv4/tcp_ipv4.c:2350 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5662 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6107 __napi_poll+0xcb/0x490 net/core/dev.c:6771 napi_poll net/core/dev.c:6840 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6962 handle_softirqs+0x2c5/0x980 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline] __dev_queue_xmit+0x1764/0x3e80 net/core/dev.c:4451 dev_queue_xmit include/linux/netdevice.h:3094 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:236 ip_local_out net/ipv4/ip_output.c:130 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:536 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_transmit_skb net/ipv4/tcp_output.c:1484 [inline] tcp_mtu_probe net/ipv4/tcp_output.c:2547 [inline] tcp_write_xmit+0x641d/0x6bf0 net/ipv4/tcp_output.c:2752 __tcp_push_pending_frames+0x9b/0x360 net/ipv4/tcp_output.c:3015 tcp_push_pending_frames include/net/tcp.h:2107 [inline] tcp_data_snd_check net/ipv4/tcp_input.c:5714 [inline] tcp_rcv_established+0x1026/0x2020 net/ipv4/tcp_input.c:6239 tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915 sk_backlog_rcv include/net/sock.h:1113 [inline] __release_sock+0x214/0x350 net/core/sock.c:3072 release_sock+0x61/0x1f0 net/core/sock.c:3626 mptcp_push_release net/mptcp/protocol.c:1486 [inline] __mptcp_push_pending+0x6b5/0x9f0 net/mptcp/protocol.c:1625 mptcp_sendmsg+0x10bb/0x1b10 net/mptcp/protocol.c:1903 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2603 ___sys_sendmsg net/socket.c:2657 [inline] __sys_sendmsg+0x2aa/0x390 net/socket.c:2686 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb06e9317f9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe2cfd4f98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fb06e97f468 RCX: 00007fb06e9317f9 RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000005 RBP: 00007fb06e97f446 R08: 0000555500000000 R09: 0000555500000000 R10: 0000555500000000 R11: 0000000000000246 R12: 00007fb06e97f406 R13: 0000000000000001 R14: 00007ffe2cfd4fe0 R15: 0000000000000003 </TASK> Additionally syzkaller provided a nice reproducer. The repro enables pmtu on the loopback device, leading to tcp_mtu_probe() generating very large probe packets. tcp_can_coalesce_send_queue_head() currently does not check for mptcp-level invariants, and allowed the creation of cross-DSS probes, leading to the mentioned corruption. Address the issue teaching tcp_can_coalesce_send_queue_head() about mptcp using the tcp_skb_can_collapse(), also reducing the code duplication. Fixes: 85712484110d ("tcp: coalesce/collapse must respect MPTCP extensions") Cc: stable@vger.kernel.org Reported-by: syzbot+d1bff73460e33101f0e7@syzkaller.appspotmail.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/513 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-2-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09mptcp: handle consistently DSS corruptionPaolo Abeni
Bugged peer implementation can send corrupted DSS options, consistently hitting a few warning in the data path. Use DEBUG_NET assertions, to avoid the splat on some builds and handle consistently the error, dumping related MIBs and performing fallback and/or reset according to the subflow type. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-1-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: netconsole: fix wrong warningBreno Leitao
A warning is triggered when there is insufficient space in the buffer for userdata. However, this is not an issue since userdata will be sent in the next iteration. Current warning message: ------------[ cut here ]------------ WARNING: CPU: 13 PID: 3013042 at drivers/net/netconsole.c:1122 write_ext_msg+0x3b6/0x3d0 ? write_ext_msg+0x3b6/0x3d0 console_flush_all+0x1e9/0x330 The code incorrectly issues a warning when this_chunk is zero, which is a valid scenario. The warning should only be triggered when this_chunk is negative. Fixes: 1ec9daf95093 ("net: netconsole: append userdata to fragmented netconsole messages") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241008094325.896208-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: dsa: refuse cross-chip mirroring operationsVladimir Oltean
In case of a tc mirred action from one switch to another, the behavior is not correct. We simply tell the source switch driver to program a mirroring entry towards mirror->to_local_port = to_dp->index, but it is not even guaranteed that the to_dp belongs to the same switch as dp. For proper cross-chip support, we would need to go through the cross-chip notifier layer in switch.c, program the entry on cascade ports, and introduce new, explicit API for cross-chip mirroring, given that intermediary switches should have introspection into the DSA tags passed through the cascade port (and not just program a port mirror on the entire cascade port). None of that exists today. Reject what is not implemented so that user space is not misled into thinking it works. Fixes: f50f212749e8 ("net: dsa: Add plumbing for port mirroring") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20241008094320.3340980-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv6: Remove redundant unlikely()Tobias Klauser
IS_ERR_OR_NULL() already implies unlikely(). Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241008085454.8087-1-tklauser@distanz.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: fec: don't save PTP state if PTP is unsupportedWei Fang
Some platforms (such as i.MX25 and i.MX27) do not support PTP, so on these platforms fec_ptp_init() is not called and the related members in fep are not initialized. However, fec_ptp_save_state() is called unconditionally, which causes the kernel to panic. Therefore, add a condition so that fec_ptp_save_state() is not called if PTP is not supported. Fixes: a1477dc87dc4 ("net: fec: Restart PPS after link state change") Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/lkml/353e41fe-6bb4-4ee9-9980-2da2a9c1c508@roeck-us.net/ Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Csókás, Bence <csokas.bence@prolan.hu> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://patch.msgid.link/20241008061153.1977930-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv6: switch inet6_acaddr_hash() to less predictable hashEric Dumazet
commit 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable") added inet6_acaddr_hash(), using ipv6_addr_hash() and net_hash_mix() to get hash spreading for typical users. However ipv6_addr_hash() is highly predictable and a malicious user could abuse a specific hash bucket. Switch to __ipv6_addr_jhash(). We could use a dedicated secret, or reuse net_hash_mix() as I did in this patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241008121307.800040-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv6: switch inet6_addr_hash() to less predictable hashEric Dumazet
In commit 3f27fb23219e ("ipv6: addrconf: add per netns perturbation in inet6_addr_hash()"), I added net_hash_mix() in inet6_addr_hash() to get better hash dispersion, at a time all netns were sharing the hash table. Since then, commit 21a216a8fc63 ("ipv6/addrconf: allocate a per netns hash table") made the hash table per netns. We could remove the net_hash_mix() from inet6_addr_hash(), but there is still an issue with ipv6_addr_hash(). It is highly predictable and a malicious user can easily create thousands of IPv6 addresses all stored in the same hash bucket. Switch to __ipv6_addr_jhash(). We could use a dedicated secret, or reuse net_hash_mix() as I did in this patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241008120101.734521-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: airoha: Fix EGRESS_RATE_METER_EN_MASK definitionLorenzo Bianconi
Fix typo in EGRESS_RATE_METER_EN_MASK mask definition. This bus in not introducing any user visible problem since, even if we are setting EGRESS_RATE_METER_EN_MASK bit in REG_EGRESS_RATE_METER_CFG register, egress QoS metering is not supported yet since we are missing some other hw configurations (e.g token bucket rate, token bucket size). Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241009-airoha-fixes-v2-1-18af63ec19bf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: ibm: emac: mal: add dcr_unmap to _removeRosen Penev
It's done in probe so it should be undone here. Fixes: 1d3bb996481e ("Device tree aware EMAC driver") Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20241008233050.9422-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: liquidio: Remove unused cn23xx_dump_pf_initialized_regsDr. David Alan Gilbert
cn23xx_dump_pf_initialized_regs() was added in 2016's commit 72c0091293c0 ("liquidio: CN23XX device init and sriov config") but hasn't been used. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241009003841.254853-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09Merge branch 'qca_spi-improvements-to-qca7000-sync'Jakub Kicinski
Stefan Wahren says: ==================== qca_spi: Improvements to QCA7000 sync This series contains patches which improve the QCA7000 sync behavior. ==================== Link: https://patch.msgid.link/20241007113312.38728-1-wahrenst@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09qca_spi: Improve reset mechanismStefan Wahren
The commit 92717c2356cb ("net: qca_spi: Avoid high load if QCA7000 is not available") fixed the high load in case the QCA7000 is not available but introduced sync delays for some corner cases like buffer errors. So add the reset requests to the atomics flags, which are polled by the SPI thread. As a result reset requests and sync state are now separated. This has the nice benefit to make the code easier to understand. Signed-off-by: Stefan Wahren <wahrenst@gmx.net> Link: https://patch.msgid.link/20241007113312.38728-3-wahrenst@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09qca_spi: Count unexpected WRBUF_SPC_AVA after resetStefan Wahren
After a reset of the QCA7000, the amount of available write buffer space should match QCASPI_HW_BUF_LEN. If this is not the case this error should be counted as such. Signed-off-by: Stefan Wahren <wahrenst@gmx.net> Link: https://patch.msgid.link/20241007113312.38728-2-wahrenst@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09tcp: remove unnecessary update for tp->write_seq in tcp_connect()xin.guo
Commit 783237e8daf13 ("net-tcp: Fast Open client - sending SYN-data") introduces tcp_connect_queue_skb() and it would overwrite tcp->write_seq, so it is no need to update tp->write_seq before invoking tcp_connect_queue_skb(). Signed-off-by: xin.guo <guoxin0309@gmail.com> Link: https://patch.msgid.link/1728289544-4611-1-git-send-email-guoxin0309@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net: ftgmac100: fixed not check status from fixed phyJacky Chou
Add error handling from calling fixed_phy_register. It may return some error, therefore, need to check the status. And fixed_phy_register needs to bind a device node for mdio. Add the mac device node for fixed_phy_register function. This is a reference to this function, of_phy_register_fixed_link(). Fixes: e24a6c874601 ("net: ftgmac100: Get link speed and duplex for NC-SI") Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com> Link: https://patch.msgid.link/20241007032435.787892-1-jacky_chou@aspeedtech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09doc: net: Fix .rst rendering of net_cachelines pagesDonald Hunter
The doc pages under /networking/net_cachelines are unreadable because they lack .rst formatting for the tabular text. Add simple table markup and tidy up the table contents: - remove dashes that represent empty cells because they render as bullets and are not needed - replace 'struct_*' with 'struct *' in the first column so that sphinx can render links for any structs that appear in the docs Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241008165329.45647-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09Merge branch 'ipv4-convert-__fib_validate_source-and-its-callers-to-dscp_t'Jakub Kicinski
Guillaume Nault says: ==================== ipv4: Convert __fib_validate_source() and its callers to dscp_t. This patch series continues to prepare users of ->flowi4_tos to a future conversion of this field (__u8 to dscp_t). This time, we convert __fib_validate_source() and its call chain. The objective is to eventually make all users of ->flowi4_tos use a dscp_t value. Making ->flowi4_tos a dscp_t field will help avoiding regressions where ECN bits are erroneously interpreted as DSCP bits. ==================== Link: https://patch.msgid.link/cover.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert __fib_validate_source() to dscp_t.Guillaume Nault
Pass a dscp_t variable to __fib_validate_source(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Only fib_validate_source() actually calls __fib_validate_source(). Since it already has a dscp_t variable to pass as parameter, we only need to remove the inet_dscp_to_dsfield() conversion. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/8206b0a64a21a208ed94774e261a251c8d7bc251.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert fib_validate_source() to dscp_t.Guillaume Nault
Pass a dscp_t variable to fib_validate_source(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. All callers of fib_validate_source() already have a dscp_t variable to pass as parameter. We just need to remove the inet_dscp_to_dsfield() conversions. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/08612a4519bc5a3578bb493fbaad82437ebb73dc.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert ip_mc_validate_source() to dscp_t.Guillaume Nault
Pass a dscp_t variable to ip_mc_validate_source(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Callers of ip_mc_validate_source() to consider are: * ip_route_input_mc() which already has a dscp_t variable to pass as parameter. We just need to remove the inet_dscp_to_dsfield() conversion. * udp_v4_early_demux() which gets the DSCP directly from the IPv4 header and can simply use the ip4h_dscp() helper. Also, stop including net/inet_dscp.h in udp.c as we don't use any of its declarations anymore. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/c91b2cca04718b7ee6cf5b9c1d5b40507d65a8d4.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert ip_route_input_mc() to dscp_t.Guillaume Nault
Pass a dscp_t variable to ip_route_input_mc(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Only ip_route_input_rcu() actually calls ip_route_input_mc(). Since it already has a dscp_t variable to pass as parameter, we only need to remove the inet_dscp_to_dsfield() conversion. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/0cc653ef59bbc0a28881f706d34896c61eba9e01.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert __mkroute_input() to dscp_t.Guillaume Nault
Pass a dscp_t variable to __mkroute_input(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Only ip_mkroute_input() actually calls __mkroute_input(). Since it already has a dscp_t variable to pass as parameter, we only need to remove the inet_dscp_to_dsfield() conversion. While there, reorganise the function parameters to fill up horizontal space. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/40853c720aee4d608e6b1b204982164c3b76697d.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert ip_mkroute_input() to dscp_t.Guillaume Nault
Pass a dscp_t variable to ip_mkroute_input(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Only ip_route_input_slow() actually calls ip_mkroute_input(). Since it already has a dscp_t variable to pass as parameter, we only need to remove the inet_dscp_to_dsfield() conversion. While there, reorganise the function parameters to fill up horizontal space. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/6aa71e28f9ff681cbd70847080e1ab6b526f94f1.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09ipv4: Convert ip_route_use_hint() to dscp_t.Guillaume Nault
Pass a dscp_t variable to ip_route_use_hint(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Only ip_rcv_finish_core() actually calls ip_route_use_hint(). Use the ip4h_dscp() helper to get the DSCP from the IPv4 header. While there, modify the declaration of ip_route_use_hint() in include/net/route.h so that it matches the prototype of its implementation in net/ipv4/route.c. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/c40994fdf804db7a363d04fdee01bf48dddda676.1728302212.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes, 5 of which are c:stable. All singletons, about half of which are MM" * tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: zswap: delete comments for "value" member of 'struct zswap_entry'. CREDITS: sort alphabetically by name secretmem: disable memfd_secret() if arch cannot set direct map .mailmap: update Fangrui's email mm/huge_memory: check pmd_special() only after pmd_present() resource, kunit: fix user-after-free in resource_test_region_intersects() fs/proc/kcore.c: allow translation of physical memory addresses selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test device-dax: correct pgoff align in dax_set_mapping() kthread: unpark only parked kthread Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN" bcachefs: do not use PF_MEMALLOC_NORECLAIM
2024-10-09selftests: netfilter: conntrack_vrf.sh: add fib test caseFlorian Westphal
meta iifname veth0 ip daddr ... fib daddr oif ... is expected to return "dummy0" interface which is part of same vrf as veth0. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-09netfilter: fib: check correct rtable in vrf setupsFlorian Westphal
We need to init l3mdev unconditionally, else main routing table is searched and incorrect result is returned unless strict (iif keyword) matching is requested. Next patch adds a selftest for this. Fixes: 2a8a7c0eaa87 ("netfilter: nft_fib: Fix for rpath check with VRF devices") Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1761 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-09netfilter: xtables: avoid NFPROTO_UNSPEC where neededFlorian Westphal
syzbot managed to call xt_cluster match via ebtables: WARNING: CPU: 0 PID: 11 at net/netfilter/xt_cluster.c:72 xt_cluster_mt+0x196/0x780 [..] ebt_do_table+0x174b/0x2a40 Module registers to NFPROTO_UNSPEC, but it assumes ipv4/ipv6 packet processing. As this is only useful to restrict locally terminating TCP/UDP traffic, register this for ipv4 and ipv6 family only. Pablo points out that this is a general issue, direct users of the set/getsockopt interface can call into targets/matches that were only intended for use with ip(6)tables. Check all UNSPEC matches and targets for similar issues: - matches and targets are fine except if they assume skb_network_header() is valid -- this is only true when called from inet layer: ip(6) stack pulls the ip/ipv6 header into linear data area. - targets that return XT_CONTINUE or other xtables verdicts must be restricted too, they are incompatbile with the ebtables traverser, e.g. EBT_CONTINUE is a completely different value than XT_CONTINUE. Most matches/targets are changed to register for NFPROTO_IPV4/IPV6, as they are provided for use by ip(6)tables. The MARK target is also used by arptables, so register for NFPROTO_ARP too. While at it, bail out if connbytes fails to enable the corresponding conntrack family. This change passes the selftests in iptables.git. Reported-by: syzbot+256c348558aa5cf611a9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netfilter-devel/66fec2e2.050a0220.9ec68.0047.GAE@google.com/ Fixes: 0269ea493734 ("netfilter: xtables: add cluster match") Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-09mm: zswap: delete comments for "value" member of 'struct zswap_entry'.Kanchana P Sridhar
Made a minor edit in the comments for 'struct zswap_entry' to delete the description of the 'value' member that was deleted in commit 20a5532ffa53 ("mm: remove code to handle same filled pages"). Link: https://lkml.kernel.org/r/20241002194213.30041-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Fixes: 20a5532ffa53 ("mm: remove code to handle same filled pages") Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Usama Arif <usamaarif642@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09CREDITS: sort alphabetically by nameKrzysztof Kozlowski
Re-sort few misplaced entries in the CREDITS file. Link: https://lkml.kernel.org/r/20241002111932.46012-1-krzysztof.kozlowski@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09secretmem: disable memfd_secret() if arch cannot set direct mapPatrick Roy
Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This is the case for example on some arm64 configurations, where marking 4k PTEs in the direct map not present can only be done if the direct map is set up at 4k granularity in the first place (as ARM's break-before-make semantics do not easily allow breaking apart large/gigantic pages). More precisely, on arm64 systems with !can_set_direct_map(), set_direct_map_invalid_noflush() is a no-op, however it returns success (0) instead of an error. This means that memfd_secret will seemingly "work" (e.g. syscall succeeds, you can mmap the fd and fault in pages), but it does not actually achieve its goal of removing its memory from the direct map. Note that with this patch, memfd_secret() will start erroring on systems where can_set_direct_map() returns false (arm64 with CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and CONFIG_KFENCE=n), but that still seems better than the current silent failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most arm64 systems actually have a working memfd_secret() and aren't be affected. From going through the iterations of the original memfd_secret patch series, it seems that disabling the syscall in these scenarios was the intended behavior [1] (preferred over having set_direct_map_invalid_noflush return an error as that would result in SIGBUSes at page-fault time), however the check for it got dropped between v16 [2] and v17 [3], when secretmem moved away from CMA allocations. [1]: https://lore.kernel.org/lkml/20201124164930.GK8537@kernel.org/ [2]: https://lore.kernel.org/lkml/20210121122723.3446-11-rppt@kernel.org/#t [3]: https://lore.kernel.org/lkml/20201125092208.12544-10-rppt@kernel.org/ Link: https://lkml.kernel.org/r/20241001080056.784735-1-roypat@amazon.co.uk Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Patrick Roy <roypat@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09.mailmap: update Fangrui's emailFangrui Song
I'm leaving Google. Link: https://lkml.kernel.org/r/20240927192912.31532-1-i@maskray.me Signed-off-by: Fangrui Song <i@maskray.me> Acked-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09mm/huge_memory: check pmd_special() only after pmd_present()David Hildenbrand
We should only check for pmd_special() after we made sure that we have a present PMD. For example, if we have a migration PMD, pmd_special() might indicate that we have a special PMD although we really don't. This fixes confusing migration entries as PFN mappings, and not doing what we are supposed to do in the "is_swap_pmd()" case further down in the function -- including messing up COW, page table handling and accounting. Link: https://lkml.kernel.org/r/20240926154234.2247217-1-david@redhat.com Fixes: bc02afbd4d73 ("mm/fork: accept huge pfnmap entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+bf2c35fa302ebe3c7471@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/66f15c8d.050a0220.c23dd.000f.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09resource, kunit: fix user-after-free in resource_test_region_intersects()Huang Ying
In resource_test_insert_resource(), the pointer is used in error message after kfree(). This is user-after-free. To fix this, we need to call kunit_add_action_or_reset() to schedule memory freeing after usage. But kunit_add_action_or_reset() itself may fail and free the memory. So, its return value should be checked and abort the test for failure. Then, we found that other usage of kunit_add_action_or_reset() in resource_test_region_intersects() needs to be fixed too. We fix all these user-after-free bugs in this patch. Link: https://lkml.kernel.org/r/20240930070611.353338-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Kees Bakker <kees@ijzerbout.nl> Closes: https://lore.kernel.org/lkml/87ldzaotcg.fsf@yhuang6-desk2.ccr.corp.intel.com/ Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09fs/proc/kcore.c: allow translation of physical memory addressesAlexander Gordeev
When /proc/kcore is read an attempt to read the first two pages results in HW-specific page swap on s390 and another (so called prefix) pages are accessed instead. That leads to a wrong read. Allow architecture-specific translation of memory addresses using kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That way an architecture can deal with specific physical memory ranges. Re-use the existing /dev/mem callback implementation on s390, which handles the described prefix pages swapping correctly. For other architectures the default callback is basically NOP. It is expected the condition (vaddr == __va(__pa(vaddr))) always holds true for KCORE_RAM memory type. Link: https://lkml.kernel.org/r/20240930122119.1651546-1-agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map testDonet Tom
The hmm2 double_map test was failing due to an incorrect buffer->mirror size. The buffer->mirror size was 6, while buffer->ptr size was 6 * PAGE_SIZE. The test failed because the kernel's copy_to_user function was attempting to copy a 6 * PAGE_SIZE buffer to buffer->mirror. Since the size of buffer->mirror was incorrect, copy_to_user failed. This patch corrects the buffer->mirror size to 6 * PAGE_SIZE. Test Result without this patch ============================== # RUN hmm2.hmm2_device_private.double_map ... # hmm-tests.c:1680:double_map:Expected ret (-14) == 0 (0) # double_map: Test terminated by assertion # FAIL hmm2.hmm2_device_private.double_map not ok 53 hmm2.hmm2_device_private.double_map Test Result with this patch =========================== # RUN hmm2.hmm2_device_private.double_map ... # OK hmm2.hmm2_device_private.double_map ok 53 hmm2.hmm2_device_private.double_map Link: https://lkml.kernel.org/r/20240927050752.51066-1-donettom@linux.ibm.com Fixes: fee9f6d1b8df ("mm/hmm/test: add selftests for HMM") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09device-dax: correct pgoff align in dax_set_mapping()Kun(llfl)
pgoff should be aligned using ALIGN_DOWN() instead of ALIGN(). Otherwise, vmf->address not aligned to fault_size will be aligned to the next alignment, that can result in memory failure getting the wrong address. It's a subtle situation that only can be observed in page_mapped_in_vma() after the page is page fault handled by dev_dax_huge_fault. Generally, there is little chance to perform page_mapped_in_vma in dev-dax's page unless in specific error injection to the dax device to trigger an MCE - memory-failure. In that case, page_mapped_in_vma() will be triggered to determine which task is accessing the failure address and kill that task in the end. We used self-developed dax device (which is 2M aligned mapping) , to perform error injection to random address. It turned out that error injected to non-2M-aligned address was causing endless MCE until panic. Because page_mapped_in_vma() kept resulting wrong address and the task accessing the failure address was never killed properly: [ 3783.719419] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.049006] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.049190] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.448042] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.448186] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.792026] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.792179] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.162502] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.162633] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.461116] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.461247] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.764730] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.764859] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.042128] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.042259] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.464293] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.464423] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.818090] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.818217] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3787.085297] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3787.085424] Memory failure: 0x200c9742: recovery action for dax page: Recovered It took us several weeks to pinpoint this problem,  but we eventually used bpftrace to trace the page fault and mce address and successfully identified the issue. Joao added: ; Likely we never reproduce in production because we always pin : device-dax regions in the region align they provide (Qemu does : similarly with prealloc in hugetlb/file backed memory). I think this : bug requires that we touch *unpinned* device-dax regions unaligned to : the device-dax selected alignment (page size i.e. 4K/2M/1G) Link: https://lkml.kernel.org/r/23c02a03e8d666fef11bbe13e85c69c8b4ca0624.1727421694.git.llfl@linux.alibaba.com Fixes: b9b5777f09be ("device-dax: use ALIGN() for determining pgoff") Signed-off-by: Kun(llfl) <llfl@linux.alibaba.com> Tested-by: JianXiong Zhao <zhaojianxiong.zjx@alibaba-inc.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09kthread: unpark only parked kthreadFrederic Weisbecker
Calling into kthread unparking unconditionally is mostly harmless when the kthread is already unparked. The wake up is then simply ignored because the target is not in TASK_PARKED state. However if the kthread is per CPU, the wake up is preceded by a call to kthread_bind() which expects the task to be inactive and in TASK_PARKED state, which obviously isn't the case if it is unparked. As a result, calling kthread_stop() on an unparked per-cpu kthread triggers such a warning: WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525 <TASK> kthread_stop+0x17a/0x630 kernel/kthread.c:707 destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810 wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257 netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693 default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769 ops_exit_list net/core/net_namespace.c:178 [inline] cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd70 kernel/workqueue.c:3393 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fix this with skipping unecessary unparking while stopping a kthread. Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org Fixes: 5c25b5ff89f0 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Hillf Danton <hdanton@sina.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"Michal Hocko
This reverts commit eab0af905bfc3e9c05da2ca163d76a1513159aa4. There is no existing user of those flags. PF_MEMALLOC_NOWARN is dangerous because a nested allocation context can use GFP_NOFAIL which could cause unexpected failure. Such a code would be hard to maintain because it could be deeper in the call chain. PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] that such a allocation contex is inherently unsafe if the context doesn't fully control all allocations called from this context. While PF_MEMALLOC_NOWARN is not dangerous the way PF_MEMALLOC_NORECLAIM is it doesn't have any user and as Matthew has pointed out we are running out of those flags so better reclaim it without any real users. [1] https://lore.kernel.org/all/ZcM0xtlKbAOFjv5n@tiehlicka/ Link: https://lkml.kernel.org/r/20240926172940.167084-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Jan Kara <jack@suse.cz> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09bcachefs: do not use PF_MEMALLOC_NORECLAIMMichal Hocko
Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [akpm@linux-foundation.org: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09misc: sgi-gru: Don't disable preemption in GRU driverDimitri Sivanich
Disabling preemption in the GRU driver is unnecessary, and clashes with sleeping locks in several code paths. Remove preempt_disable and preempt_enable from the GRU driver. Signed-off-by: Dimitri Sivanich <sivanich@hpe.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-09Merge tag 'unicode-fixes-6.12-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode fix from Gabriel Krisman Bertazi: - Handle code-points with the Ignorable property as regular character instead of treating them as an empty string (me) * tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: Don't special case ignorable code points
2024-10-09unicode: Don't special case ignorable code pointsGabriel Krisman Bertazi
We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
2024-10-09ring-buffer: Do not have boot mapped buffers hook to CPU hotplugSteven Rostedt
The boot mapped ring buffer has its buffer mapped at a fixed location found at boot up. It is not dynamic. It cannot grow or be expanded when new CPUs come online. Do not hook fixed memory mapped ring buffers to the CPU hotplug callback, otherwise it can cause a crash when it tries to add the buffer to the memory that is already fully occupied. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home Fixes: be68d63a139bd ("ring-buffer: Add ring_buffer_alloc_range()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09net: mana: Enable debugfs files for MANA deviceShradha Gupta
Implement debugfs in MANA driver to be able to view RX,TX,EQ queue specific attributes and dump their gdma queues. These dumps can be used by other userspace utilities to improve debuggability and troubleshooting Following files are added in debugfs: /sys/kernel/debug/mana/ |-------------- 1 |--------------- EQs | |------- eq0 | | |---head | | |---tail | | |---eq_dump | |------- eq1 | . | . | |--------------- adapter-MTU |--------------- vport0 |------- RX-0 | |---cq_budget | |---cq_dump | |---cq_head | |---cq_tail | |---rq_head | |---rq_nbuf | |---rq_tail | |---rxq_dump |------- RX-1 . . |------- TX-0 | |---cq_budget | |---cq_dump | |---cq_head | |---cq_tail | |---sq_head | |---sq_pend_skb_qlen | |---sq_tail | |---txq_dump |------- TX-1 . . Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: hns3/hns: Update the maintainer for the HNS3/HNS ethernet driverJijie Shao
Yisen Zhuang has left the company in September. Jian Shen will be responsible for maintaining the hns3/hns driver's code in the future, so add Jian Shen to the hns3/hns driver's matainer list. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09r8169: add support for the temperature sensor being available from RTL8125BHeiner Kallweit
This adds support for the temperature sensor being available from RTL8125B. Register information was taken from r8125 vendor driver. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09sctp: ensure sk_state is set to CLOSED if hashing fails in sctp_listen_startXin Long
If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the hash table. This can lead to a scenario where a socket appears to be listening without actually being accessible. This patch ensures that if the hashing operation fails, the sk_state is set back to CLOSED before returning an error. Note that there is no need to undo the autobind operation if hashing fails, as the bind port can still be used for next listen() call on the same socket. Fixes: 76c6d988aeb3 ("sctp: add sock_reuseport for the sock in __sctp_hash_endpoint") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>