summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-10-29Merge branch 'intel-wired-lan-driver-fixes-2024-10-21-igb-ice'Paolo Abeni
Jacob Keller says: ==================== Intel Wired LAN Driver Fixes 2024-10-21 (igb, ice) This series includes fixes for the ice and igb drivers. Wander fixes an issue in igb when operating on PREEMPT_RT kernels due to the PREEMPT_RT kernel switching IRQs to be threaded by default. Michal fixes the ice driver to block subfunction port creation when the PF is operating in legacy (non-switchdev) mode. Arkadiusz fixes a crash when loading the ice driver on an E810 LOM which has DPLL enabled. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> ==================== Link: https://patch.msgid.link/20241021-iwl-2024-10-21-iwl-net-fixes-v1-0-a50cb3059f55@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ice: fix crash on probe for DPLL enabled E810 LOMArkadiusz Kubalewski
The E810 Lan On Motherboard (LOM) design is vendor specific. Intel provides the reference design, but it is up to vendor on the final product design. For some cases, like Linux DPLL support, the static values defined in the driver does not reflect the actual LOM design. Current implementation of dpll pins is causing the crash on probe of the ice driver for such DPLL enabled E810 LOM designs: WARNING: (...) at drivers/dpll/dpll_core.c:495 dpll_pin_get+0x2c4/0x330 ... Call Trace: <TASK> ? __warn+0x83/0x130 ? dpll_pin_get+0x2c4/0x330 ? report_bug+0x1b7/0x1d0 ? handle_bug+0x42/0x70 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? dpll_pin_get+0x117/0x330 ? dpll_pin_get+0x2c4/0x330 ? dpll_pin_get+0x117/0x330 ice_dpll_get_pins.isra.0+0x52/0xe0 [ice] ... The number of dpll pins enabled by LOM vendor is greater than expected and defined in the driver for Intel designed NICs, which causes the crash. Prevent the crash and allow generic pin initialization within Linux DPLL subsystem for DPLL enabled E810 LOM designs. Newly designed solution for described issue will be based on "per HW design" pin initialization. It requires pin information dynamically acquired from the firmware and is already in progress, planned for next-tree only. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Reviewed-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ice: block SF port creation in legacy modeMichal Swiatkowski
There is no support for SF in legacy mode. Reflect it in the code. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Fixes: eda69d654c7e ("ice: add basic devlink subfunctions support") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29igb: Disable threaded IRQ for igb_msix_otherWander Lairson Costa
During testing of SR-IOV, Red Hat QE encountered an issue where the ip link up command intermittently fails for the igbvf interfaces when using the PREEMPT_RT variant. Investigation revealed that e1000_write_posted_mbx returns an error due to the lack of an ACK from e1000_poll_for_ack. The underlying issue arises from the fact that IRQs are threaded by default under PREEMPT_RT. While the exact hardware details are not available, it appears that the IRQ handled by igb_msix_other must be processed before e1000_poll_for_ack times out. However, e1000_write_posted_mbx is called with preemption disabled, leading to a scenario where the IRQ is serviced only after the failure of e1000_write_posted_mbx. To resolve this, we set IRQF_NO_THREAD for the affected interrupt, ensuring that the kernel handles it immediately, thereby preventing the aforementioned error. Reproducer: #!/bin/bash # echo 2 > /sys/class/net/ens14f0/device/sriov_numvfs ipaddr_vlan=3 nic_test=ens14f0 vf=${nic_test}v0 while true; do ip link set ${nic_test} mtu 1500 ip link set ${vf} mtu 1500 ip link set $vf up ip link set ${nic_test} vf 0 vlan ${ipaddr_vlan} ip addr add 172.30.${ipaddr_vlan}.1/24 dev ${vf} ip addr add 2021:db8:${ipaddr_vlan}::1/64 dev ${vf} if ! ip link show $vf | grep 'state UP'; then echo 'Error found' break fi ip link set $vf down done Signed-off-by: Wander Lairson Costa <wander@redhat.com> Fixes: 9d5c824399de ("igb: PCI-Express 82575 Gigabit Ethernet driver") Reported-by: Yuying Ma <yuma@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ASoC: codecs: wcd937x: relax the AUX PDM watchdogAlexey Klimov
On a system with wcd937x, rxmacro and Qualcomm audio DSP, which is pretty common set of devices on Qualcomm platforms, and due to the order of how DAPM widgets are powered on (they are sorted), there is a small time window when wcd937x chip is online and expects the flow of incoming data but rxmacro is not yet online. When wcd937x is programmed to receive data via AUX port then its AUX PDM watchdog is enabled in wcd937x_codec_enable_aux_pa(). If due to some reasons the rxmacro and soundwire machinery are delayed to start streaming data, then there is a chance for this AUX PDM watchdog to reset the wcd937x codec. Such event is not logged as a message and only wcd937x IRQ counter is increased however there could be a lot of other reasons for that IRQ. There is a similar opportunity for such delay during DAPM widgets power down sequence. If wcd937x codec reset happens on the start of the playback, then there will be no sound and if such reset happens at the end of a playback then it may generate additional clicks and pops noises. On qrb4210 RB2 board without any debugging bits the wcd937x resets are sometimes observed at the end of a playback though not always. With some debugging messages or with some tracing enabled the AUX PDM watchdog resets the wcd937x codec at the start of a playback and there is no sound output at all. In this patch: - TIMEOUT_SEL bit in PDM_WD_CTL2 register is set to increase the watchdog reset delay to 100ms which eliminates the AUX PDM watchdog IRQs on qrb4210 RB2 board completely and decreases the number of unwanted clicks noises; - HOLD_OFF bit postpones triggering such watchdog IRQ till wcd937x codec reset which usually happens at the end of a playback. This allows to actually output some sound in case of debugging. Cc: Adam Skladowski <a39.skl@gmail.com> Cc: Mohammad Rafi Shaik <quic_mohs@quicinc.com> Cc: Prasad Kumpatla <quic_pkumpatl@quicinc.com> Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://patch.msgid.link/20241022033132.787416-3-alexey.klimov@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-10-29ASoC: codecs: wcd937x: add missing LO Switch controlAlexey Klimov
The wcd937x supports also AUX input but the control that sets correct soundwire port for this is missing. This control is required for audio playback, for instance, on qrb4210 RB2 board as well as on other SoCs. Reported-by: Adam Skladowski <a39.skl@gmail.com> Reported-by: Prasad Kumpatla <quic_pkumpatl@quicinc.com> Suggested-by: Adam Skladowski <a39.skl@gmail.com> Suggested-by: Prasad Kumpatla <quic_pkumpatl@quicinc.com> Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Cc: Mohammad Rafi Shaik <quic_mohs@quicinc.com> Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://patch.msgid.link/20241022033132.787416-2-alexey.klimov@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-10-29ASoC: dt-bindings: rockchip,rk3308-codec: add port propertyDmitry Yashin
Fix DTB warnings when rk3308-codec used with audio-graph-card by documenting port property: codec@ff560000: 'port' does not match any of the regexes: 'pinctrl-[0-9]+' Signed-off-by: Dmitry Yashin <dmt.yashin@gmail.com> Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Link: https://patch.msgid.link/20241028213314.476776-2-dmt.yashin@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-10-29Merge branch 'ipv4-convert-rtm_-new-del-addr-and-more-to-per-netns-rtnl'Paolo Abeni
Kuniyuki Iwashima says: ==================== ipv4: Convert RTM_{NEW,DEL}ADDR and more to per-netns RTNL. The IPv4 address hash table and GC are already namespacified. This series converts RTM_NEWADDR/RTM_DELADDR and some more RTNL users to per-netns RTNL. Changes: v2: * Add patch 1 to address sparse warning for CONFIG_DEBUG_NET_SMALL_RTNL=n * Add Eric's tags to patch 2-12 v1: https://lore.kernel.org/netdev/20241018012225.90409-1-kuniyu@amazon.com/ ==================== Link: https://patch.msgid.link/20241021183239.79741-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert devinet_ioctl to per-netns RTNL.Kuniyuki Iwashima
ioctl(SIOCGIFCONF) calls dev_ifconf() that operates on the current netns. Let's use per-netns RTNL helpers in dev_ifconf() and inet_gifconf(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert devinet_ioctl() to per-netns RTNL except for SIOCSIFFLAGS.Kuniyuki Iwashima
Basically, devinet_ioctl() operates on a single netns. However, ioctl(SIOCSIFFLAGS) will trigger the netdev notifier that could touch another netdev in different netns. Let's use per-netns RTNL helper in devinet_ioctl() and place ASSERT_RTNL() for SIOCSIFFLAGS. We will remove ASSERT_RTNL() once RTM_SETLINK and RTM_DELLINK are converted. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert devinet_sysctl_forward() to per-netns RTNL.Kuniyuki Iwashima
devinet_sysctl_forward() touches only a single netns. Let's use rtnl_trylock() and __in_dev_get_rtnl_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29rtnetlink: Define rtnl_net_trylock().Kuniyuki Iwashima
We will need the per-netns version of rtnl_trylock(). rtnl_net_trylock() calls __rtnl_net_lock() only when rtnl_trylock() successfully holds RTNL. When RTNL is removed, we will use mutex_trylock() for per-netns RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert check_lifetime() to per-netns RTNL.Kuniyuki Iwashima
Since commit 1675f385213e ("ipv4: Namespacify IPv4 address GC."), check_lifetime() works on a per-netns basis. Let's use rtnl_net_lock() and rtnl_net_dereference(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert RTM_DELADDR to per-netns RTNL.Kuniyuki Iwashima
Let's push down RTNL into inet_rtm_deladdr() as rtnl_net_lock(). Now, ip_mc_autojoin_config() is always called under per-netns RTNL, so ASSERT_RTNL() can be replaced with ASSERT_RTNL_NET(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Use per-netns RTNL helpers in inet_rtm_newaddr().Kuniyuki Iwashima
inet_rtm_to_ifa() and find_matching_ifa() are called under rtnl_net_lock(). __in_dev_get_rtnl() and in_dev_for_each_ifa_rtnl() there can use per-netns RTNL helpers. Let's define and use __in_dev_get_rtnl_net() and in_dev_for_each_ifa_rtnl_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Convert RTM_NEWADDR to per-netns RTNL.Kuniyuki Iwashima
The address hash table and GC are already namespacified. Let's push down RTNL into inet_rtm_newaddr() as rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Don't allocate ifa for 0.0.0.0 in inet_rtm_newaddr().Kuniyuki Iwashima
When we pass 0.0.0.0 to __inet_insert_ifa(), it frees ifa and returns 0. We can do this check much earlier for RTM_NEWADDR even before allocating struct in_ifaddr. Let's move the validation to 1. inet_insert_ifa() for ioctl() 2. inet_rtm_newaddr() for RTM_NEWADDR Now, we can remove the same check in find_matching_ifa(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29ipv4: Factorise RTM_NEWADDR validation to inet_validate_rtm().Kuniyuki Iwashima
rtm_to_ifaddr() validates some attributes, looks up a netdev, allocates struct in_ifaddr, and validates IFA_CACHEINFO. There is no reason to delay IFA_CACHEINFO validation. We will push RTNL down to inet_rtm_newaddr(), and then we want to complete rtnetlink validation before rtnl_net_lock(). Let's factorise the validation parts. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29rtnetlink: Define RTNL_FLAG_DOIT_PERNET for per-netns RTNL doit().Kuniyuki Iwashima
We will push RTNL down to each doit() as rtnl_net_lock(). We can use RTNL_FLAG_DOIT_UNLOCKED to call doit() without RTNL, but doit() will still hold RTNL. Let's define RTNL_FLAG_DOIT_PERNET as an alias of RTNL_FLAG_DOIT_UNLOCKED. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29rtnetlink: Make per-netns RTNL dereference helpers to macro.Kuniyuki Iwashima
When CONFIG_DEBUG_NET_SMALL_RTNL is off, rtnl_net_dereference() is the static inline wrapper of rtnl_dereference() returning a plain (void *) pointer to make sure net is always evaluated as requested in [0]. But, it makes sparse complain [1] when the pointer has __rcu annotation: net/ipv4/devinet.c:674:47: sparse: warning: incorrect type in argument 2 (different address spaces) net/ipv4/devinet.c:674:47: sparse: expected void *p net/ipv4/devinet.c:674:47: sparse: got struct in_ifaddr [noderef] __rcu * Also, if we evaluate net as (void *) in a macro, then the compiler in turn fails to build due to -Werror=unused-value. #define rtnl_net_dereference(net, p) \ ({ \ (void *)net; \ rtnl_dereference(p); \ }) net/ipv4/devinet.c: In function ‘inet_rtm_deladdr’: ./include/linux/rtnetlink.h:154:17: error: statement with no effect [-Werror=unused-value] 154 | (void *)net; \ net/ipv4/devinet.c:674:21: note: in expansion of macro ‘rtnl_net_dereference’ 674 | (ifa = rtnl_net_dereference(net, *ifap)) != NULL; | ^~~~~~~~~~~~~~~~~~~~ Let's go back to the original simplest macro. Note that checkpatch complains about this approach, but it's one-shot and less noisy than the other two. WARNING: Argument 'net' is not used in function-like macro #76: FILE: include/linux/rtnetlink.h:142: +#define rtnl_net_dereference(net, p) \ + rtnl_dereference(p) Fixes: 844e5e7e656d ("rtnetlink: Add assertion helpers for per-netns RTNL.") Link: https://lore.kernel.org/netdev/20241004132145.7fd208e9@kernel.org/ [0] Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410200325.SaEJmyZS-lkp@intel.com/ [1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: stmmac: TSO: Fix unbalanced DMA map/unmap for non-paged SKB dataFurong Xu
In case the non-paged data of a SKB carries protocol header and protocol payload to be transmitted on a certain platform that the DMA AXI address width is configured to 40-bit/48-bit, or the size of the non-paged data is bigger than TSO_MAX_BUFF_SIZE on a certain platform that the DMA AXI address width is configured to 32-bit, then this SKB requires at least two DMA transmit descriptors to serve it. For example, three descriptors are allocated to split one DMA buffer mapped from one piece of non-paged data: dma_desc[N + 0], dma_desc[N + 1], dma_desc[N + 2]. Then three elements of tx_q->tx_skbuff_dma[] will be allocated to hold extra information to be reused in stmmac_tx_clean(): tx_q->tx_skbuff_dma[N + 0], tx_q->tx_skbuff_dma[N + 1], tx_q->tx_skbuff_dma[N + 2]. Now we focus on tx_q->tx_skbuff_dma[entry].buf, which is the DMA buffer address returned by DMA mapping call. stmmac_tx_clean() will try to unmap the DMA buffer _ONLY_IF_ tx_q->tx_skbuff_dma[entry].buf is a valid buffer address. The expected behavior that saves DMA buffer address of this non-paged data to tx_q->tx_skbuff_dma[entry].buf is: tx_q->tx_skbuff_dma[N + 0].buf = NULL; tx_q->tx_skbuff_dma[N + 1].buf = NULL; tx_q->tx_skbuff_dma[N + 2].buf = dma_map_single(); Unfortunately, the current code misbehaves like this: tx_q->tx_skbuff_dma[N + 0].buf = dma_map_single(); tx_q->tx_skbuff_dma[N + 1].buf = NULL; tx_q->tx_skbuff_dma[N + 2].buf = NULL; On the stmmac_tx_clean() side, when dma_desc[N + 0] is closed by the DMA engine, tx_q->tx_skbuff_dma[N + 0].buf is a valid buffer address obviously, then the DMA buffer will be unmapped immediately. There may be a rare case that the DMA engine does not finish the pending dma_desc[N + 1], dma_desc[N + 2] yet. Now things will go horribly wrong, DMA is going to access a unmapped/unreferenced memory region, corrupted data will be transmited or iommu fault will be triggered :( In contrast, the for-loop that maps SKB fragments behaves perfectly as expected, and that is how the driver should do for both non-paged data and paged frags actually. This patch corrects DMA map/unmap sequences by fixing the array index for tx_q->tx_skbuff_dma[entry].buf when assigning DMA buffer address. Tested and verified on DWXGMAC CORE 3.20a Reported-by: Suraj Jaiswal <quic_jsuraj@quicinc.com> Fixes: f748be531d70 ("stmmac: support new GMAC4") Signed-off-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241021061023.2162701-1-0x1207@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29net: stmmac: dwmac4: Fix high address display by updating reg_space[] from ↵Ley Foon Tan
register values The high address will display as 0 if the driver does not set the reg_space[]. To fix this, read the high address registers and update the reg_space[] accordingly. Fixes: fbf68229ffe7 ("net: stmmac: unify registers dumps methods") Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241021054625.1791965-1-leyfoon.tan@starfivetech.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-29mm: krealloc: Fix MTE false alarm in __do_kreallocQun-Wei Lin
This patch addresses an issue introduced by commit 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") which causes MTE (Memory Tagging Extension) to falsely report a slab-out-of-bounds error. The problem occurs when zeroing out spare memory in __do_krealloc. The original code only considered software-based KASAN and did not account for MTE. It does not reset the KASAN tag before calling memset, leading to a mismatch between the pointer tag and the memory tag, resulting in a false positive. Example of the error: ================================================================== swapper/0: BUG: KASAN: slab-out-of-bounds in __memset+0x84/0x188 swapper/0: Write at addr f4ffff8005f0fdf0 by task swapper/0/1 swapper/0: Pointer tag: [f4], memory tag: [fe] swapper/0: swapper/0: CPU: 4 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12. swapper/0: Hardware name: MT6991(ENG) (DT) swapper/0: Call trace: swapper/0: dump_backtrace+0xfc/0x17c swapper/0: show_stack+0x18/0x28 swapper/0: dump_stack_lvl+0x40/0xa0 swapper/0: print_report+0x1b8/0x71c swapper/0: kasan_report+0xec/0x14c swapper/0: __do_kernel_fault+0x60/0x29c swapper/0: do_bad_area+0x30/0xdc swapper/0: do_tag_check_fault+0x20/0x34 swapper/0: do_mem_abort+0x58/0x104 swapper/0: el1_abort+0x3c/0x5c swapper/0: el1h_64_sync_handler+0x80/0xcc swapper/0: el1h_64_sync+0x68/0x6c swapper/0: __memset+0x84/0x188 swapper/0: btf_populate_kfunc_set+0x280/0x3d8 swapper/0: __register_btf_kfunc_id_set+0x43c/0x468 swapper/0: register_btf_kfunc_id_set+0x48/0x60 swapper/0: register_nf_nat_bpf+0x1c/0x40 swapper/0: nf_nat_init+0xc0/0x128 swapper/0: do_one_initcall+0x184/0x464 swapper/0: do_initcall_level+0xdc/0x1b0 swapper/0: do_initcalls+0x70/0xc0 swapper/0: do_basic_setup+0x1c/0x28 swapper/0: kernel_init_freeable+0x144/0x1b8 swapper/0: kernel_init+0x20/0x1a8 swapper/0: ret_from_fork+0x10/0x20 ================================================================== Fixes: 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") Signed-off-by: Qun-Wei Lin <qun-wei.lin@mediatek.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-29ALSA: hda/realtek: Add subwoofer quirk for Infinix ZERO BOOK 13Piyush Raj Chouhan
Infinix ZERO BOOK 13 has a 2+2 speaker system which isn't probed correctly. This patch adds a quirk with the proper pin connections. Also The mic in this laptop suffers too high gain resulting in mostly fan noise being recorded, This patch Also limit mic boost. HW Probe for device; https://linux-hardware.org/?probe=a2e892c47b Test: All 4 speaker works, Mic has low noise. Signed-off-by: Piyush Raj Chouhan <piyushchouhan1598@gmail.com> Link: https://patch.msgid.link/20241028155516.15552-1-piyuschouhan1598@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-10-28mm: avoid unconditional one-tick sleep when swapcache_prepare failsBarry Song
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") introduced an unconditional one-tick sleep when `swapcache_prepare()` fails, which has led to reports of UI stuttering on latency-sensitive Android devices. To address this, we can use a waitqueue to wake up tasks that fail `swapcache_prepare()` sooner, instead of always sleeping for a full tick. While tasks may occasionally be woken by an unrelated `do_swap_page()`, this method is preferable to two scenarios: rapid re-entry into page faults, which can cause livelocks, and multiple millisecond sleeps, which visibly degrade user experience. Oven's testing shows that a single waitqueue resolves the UI stuttering issue. If a 'thundering herd' problem becomes apparent later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit locks can be introduced. [v-songbaohua@oppo.com: wake_up only when swapcache_wq waitqueue is active] Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reported-by: Oven Liyang <liyangouwen1@oppo.com> Tested-by: Oven Liyang <liyangouwen1@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mseal: update mseal.rstJeff Xu
Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces the can_modify_mm() function with an in-loop check, necessitates an update to the mseal.rst documentation to reflect this change. Furthermore, the document has received offline comments regarding the code sample and suggestions for sentence clarification to enhance reader comprehension. [1] https://lore.kernel.org/linux-mm/20240817-mseal-depessimize-v3-0-d8d2e037df30@gmail.com/ Update doc after in-loop change: mprotect/madvise can have partially updated and munmap is atomic. Fix indentation and clarify some sections to improve readability. Link: https://lkml.kernel.org/r/20241008040942.1478931-2-jeffxu@chromium.org Fixes: df2a7df9a9aa ("mm/munmap: replace can_modify_mm with can_modify_vma") Fixes: 4a2dd02b0916 ("mm/mprotect: replace can_modify_mm with can_modify_vma") Fixes: 38075679b5f1 ("mm/mremap: replace can_modify_mm with can_modify_vma") Fixes: 23c57d1fa2b9 ("mseal: replace can_modify_mm_madv with a vma variant") Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Elliott Hughes <enh@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Theo de Raadt" <deraadt@openbsd.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: split critical region in remap_file_pages() and invoke LSMs in betweenKirill A. Shutemov
Commit ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") fixed a security issue, it added an LSM check when trying to remap file pages, so that LSMs have the opportunity to evaluate such action like for other memory operations such as mmap() and mprotect(). However, that commit called security_mmap_file() inside the mmap_lock lock, while the other calls do it before taking the lock, after commit 8b3ec6814c83 ("take security_mmap_file() outside of ->mmap_sem"). This caused lock inversion issue with IMA which was taking the mmap_lock and i_mutex lock in the opposite way when the remap_file_pages() system call was called. Solve the issue by splitting the critical region in remap_file_pages() in two regions: the first takes a read lock of mmap_lock, retrieves the VMA and the file descriptor associated, and calculates the 'prot' and 'flags' variables; the second takes a write lock on mmap_lock, checks that the VMA flags and the VMA file descriptor are the same as the ones obtained in the first critical region (otherwise the system call fails), and calls do_mmap(). In between, after releasing the read lock and before taking the write lock, call security_mmap_file(), and solve the lock inversion issue. Link: https://lkml.kernel.org/r/20241018161415.3845146-1-roberto.sassu@huaweicloud.com Fixes: ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reported-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-security-module/66f7b10e.050a0220.46d20.0036.GAE@google.com/ Tested-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Tested-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> Cc: Eric Snowberg <eric.snowberg@oracle.com> Cc: James Morris <jmorris@namei.org> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Shu Han <ebpqwerty472123@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28selftests/mm: fix deadlock for fork after pthread_create with atomic_boolEdward Liaw
Some additional synchronization is needed on Android ARM64; we see a deadlock with pthread_create when the parent thread races forward before the child has a chance to start doing work. Link: https://lkml.kernel.org/r/20241018171734.2315053-4-edliaw@google.com Fixes: cff294582798 ("selftests/mm: extend and rename uffd pagemap test") Signed-off-by: Edward Liaw <edliaw@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28Revert "selftests/mm: replace atomic_bool with pthread_barrier_t"Edward Liaw
This reverts commit e61ef21e27e8deed8c474e9f47f4aa7bc37e138c. uffd_poll_thread may be called by other tests that do not initialize the pthread_barrier, so this approach is not correct. This will revert to using atomic_bool instead. Link: https://lkml.kernel.org/r/20241018171734.2315053-3-edliaw@google.com Fixes: e61ef21e27e8 ("selftests/mm: replace atomic_bool with pthread_barrier_t") Signed-off-by: Edward Liaw <edliaw@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"Edward Liaw
Patch series "selftests/mm: revert pthread_barrier change" On Android arm, pthread_create followed by a fork caused a deadlock in the case where the fork required work to be completed by the created thread. The previous patches incorrectly assumed that the parent would always initialize the pthread_barrier for the child thread. This reverts the change and replaces the fix for wp-fork-with-event with the original use of atomic_bool. This patch (of 3): This reverts commit e142cc87ac4ec618f2ccf5f68aedcd6e28a59d9d. fork_event_consumer may be called by other tests that do not initialize the pthread_barrier, so this approach is not correct. The subsequent patch will revert to using atomic_bool instead. Link: https://lkml.kernel.org/r/20241018171734.2315053-1-edliaw@google.com Link: https://lkml.kernel.org/r/20241018171734.2315053-2-edliaw@google.com Fixes: e142cc87ac4e ("fix deadlock for fork after pthread_create on ARM") Signed-off-by: Edward Liaw <edliaw@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28tools: testing: add expand-only mode VMA testLorenzo Stoakes
Add a test to assert that VMG_FLAG_JUST_EXPAND functions as expected - that is, when the VMA iterator is positioned at the previous VMA and no VMAs proceed it, we observe an expansion with all state as expected. Explicitly place a prior VMA that would otherwise fail this test if the mode were not enabled (as it would traverse to the previous-previous VMA). Link: https://lkml.kernel.org/r/d2f88330254a6448092412bf7dfe077a579ab0dc.1729174352.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()Lorenzo Stoakes
Patch series "introduce VMA merge mode to improve brk() performance". A ~5% performance regression was discovered on the aim9.brk_test.ops_per_sec by the linux kernel test bot [0]. In the past to satisfy brk() performance we duplicated VMA expansion code and special-cased do_brk_flags(). This is however horrid and undoes work to abstract this logic, so in resolving the issue I have endeavoured to avoid this. Investigating further I was able to observe that the use of a vma_iter_next_range() and vma_prev() pair, causing an unnecessary maple tree walk. In addition there is work that we do that is simply unnecessary for brk(). Therefore, add a special VMA merge mode VMG_FLAG_JUST_EXPAND to avoid doing any of this - it assumes the VMA iterator is pointing at the previous VMA and which skips logic that brk() does not require. This mostly eliminates the performance regression reducing it to ~2% which is in the realm of noise. In addition, the will-it-scale test brk2, written to be more representative of real-world brk() usage, shows a modest performance improvement - which gives me confidence that we are not meaningfully regressing real workloads here. This series includes a test asserting that the 'just expand' mode works as expected. With many thanks to Oliver Sang for helping with performance testing of candidate patch sets! [0]:https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com This patch (of 2): We know in advance that do_brk_flags() wants only to perform a VMA expansion (if the prior VMA is compatible), and that we assume no mergeable VMA follows it. These are the semantics of this function prior to the recent rewrite of the VMA merging logic, however we are now doing more work than necessary - positioning the VMA iterator at the prior VMA and performing tasks that are not required. Add a new field to the vmg struct to permit merge flags and add a new merge flag VMG_FLAG_JUST_EXPAND which implies this behaviour, and have do_brk_flags() use this. This fixes a reported performance regression in a brk() benchmarking suite. Link: https://lkml.kernel.org/r/cover.1729174352.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4e65d4395e5841c5acf8470dbcb714016364fd39.1729174352.git.lorenzo.stoakes@oracle.com Fixes: cacded5e42b9 ("mm: avoid using vma_merge() for new VMAs") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28resource,kexec: walk_system_ram_res_rev must retain resource flagsGregory Price
walk_system_ram_res_rev() erroneously discards resource flags when passing the information to the callback. This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have these resources selected during kexec to store kexec buffers if that memory happens to be at placed above normal system ram. This leads to undefined behavior after reboot. If the kexec buffer is never touched, nothing happens. If the kexec buffer is touched, it could lead to a crash (like below) or undefined behavior. Tested on a system with CXL memory expanders with driver managed memory, TPM enabled, and CONFIG_IMA_KEXEC=y. Adding printk's showed the flags were being discarded and as a result the check for IORESOURCE_SYSRAM_DRIVER_MANAGED passes. find_next_iomem_res: name(System RAM (kmem)) start(10000000000) end(1034fffffff) flags(83000200) locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0) [.] BUG: unable to handle page fault for address: ffff89834ffff000 [.] #PF: supervisor read access in kernel mode [.] #PF: error_code(0x0000) - not-present page [.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0 [.] Oops: 0000 [#1] SMP [.] RIP: 0010:ima_restore_measurement_list+0x95/0x4b0 [.] RSP: 0018:ffffc900000d3a80 EFLAGS: 00010286 [.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000 [.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018 [.] RBP: ffffc900000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900 [.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000 [.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000 [.] FS: 0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000 [.] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [.] ata5: SATA link down (SStatus 0 SControl 300) [.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0 [.] PKRU: 55555554 [.] Call Trace: [.] <TASK> [.] ? __die+0x78/0xc0 [.] ? page_fault_oops+0x2a8/0x3a0 [.] ? exc_page_fault+0x84/0x130 [.] ? asm_exc_page_fault+0x22/0x30 [.] ? ima_restore_measurement_list+0x95/0x4b0 [.] ? template_desc_init_fields+0x317/0x410 [.] ? crypto_alloc_tfm_node+0x9c/0xc0 [.] ? init_ima_lsm+0x30/0x30 [.] ima_load_kexec_buffer+0x72/0xa0 [.] ima_init+0x44/0xa0 [.] __initstub__kmod_ima__373_1201_init_ima7+0x1e/0xb0 [.] ? init_ima_lsm+0x30/0x30 [.] do_one_initcall+0xad/0x200 [.] ? idr_alloc_cyclic+0xaa/0x110 [.] ? new_slab+0x12c/0x420 [.] ? new_slab+0x12c/0x420 [.] ? number+0x12a/0x430 [.] ? sysvec_apic_timer_interrupt+0xa/0x80 [.] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [.] ? parse_args+0xd4/0x380 [.] ? parse_args+0x14b/0x380 [.] kernel_init_freeable+0x1c1/0x2b0 [.] ? rest_init+0xb0/0xb0 [.] kernel_init+0x16/0x1a0 [.] ret_from_fork+0x2f/0x40 [.] ? rest_init+0xb0/0xb0 [.] ret_from_fork_asm+0x11/0x20 [.] </TASK> Link: https://lore.kernel.org/all/20231114091658.228030-1-bhe@redhat.com/ Link: https://lkml.kernel.org/r/20241017190347.5578-1-gourry@gourry.net Fixes: 7acf164b259d ("resource: add walk_system_ram_res_rev()") Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28nilfs2: fix kernel bug due to missing clearing of checked flagRyusuke Konishi
Syzbot reported that in directory operations after nilfs2 detects filesystem corruption and degrades to read-only, __block_write_begin_int(), which is called to prepare block writes, may fail the BUG_ON check for accesses exceeding the folio/page size, triggering a kernel bug. This was found to be because the "checked" flag of a page/folio was not cleared when it was discarded by nilfs2's own routine, which causes the sanity check of directory entries to be skipped when the directory page/folio is reloaded. So, fix that. This was necessary when the use of nilfs2's own page discard routine was applied to more than just metadata files. Link: https://lkml.kernel.org/r/20241017193359.5051-1-konishi.ryusuke@gmail.com Fixes: 8c26c4e2694a ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d6ca2daf692c7a82f959@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d6ca2daf692c7a82f959 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node idNobuhiro Iwamatsu
The acquired memory blocks for reserved may include blocks outside of memory management. In this case, the nid variable is set to NUMA_NO_NODE (-1), so an error occurs in node_set(). This adds a check using numa_valid_node() to numa_clear_kernel_node_hotplug() that skips node_set() when nid is set to NUMA_NO_NODE. Link: https://lkml.kernel.org/r/1729070461-13576-1-git-send-email-nobuhiro1.iwamatsu@toshiba.co.jp Fixes: 87482708210f ("mm: introduce numa_memblks") Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Yuji Ishikawa <yuji2.ishikawa@toshiba.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28ocfs2: pass u64 to ocfs2_truncate_inline maybe overflowEdward Adam Davis
Syzbot reported a kernel BUG in ocfs2_truncate_inline. There are two reasons for this: first, the parameter value passed is greater than ocfs2_max_inline_data_with_xattr, second, the start and end parameters of ocfs2_truncate_inline are "unsigned int". So, we need to add a sanity check for byte_start and byte_len right before ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater than ocfs2_max_inline_data_with_xattr return -EINVAL. Link: https://lkml.kernel.org/r/tencent_D48DB5122ADDAEDDD11918CFB68D93258C07@qq.com Fixes: 1afc32b95233 ("ocfs2: Write support for inline data") Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reported-by: syzbot+81092778aac03460d6b7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=81092778aac03460d6b7 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: shmem: fix data-race in shmem_getattr()Jeongjun Park
I got the following KCSAN report during syzbot testing: ================================================================== BUG: KCSAN: data-race in generic_fillattr / inode_set_ctime_current write to 0xffff888102eb3260 of 4 bytes by task 6565 on cpu 1: inode_set_ctime_to_ts include/linux/fs.h:1638 [inline] inode_set_ctime_current+0x169/0x1d0 fs/inode.c:2626 shmem_mknod+0x117/0x180 mm/shmem.c:3443 shmem_create+0x34/0x40 mm/shmem.c:3497 lookup_open fs/namei.c:3578 [inline] open_last_lookups fs/namei.c:3647 [inline] path_openat+0xdbc/0x1f00 fs/namei.c:3883 do_filp_open+0xf7/0x200 fs/namei.c:3913 do_sys_openat2+0xab/0x120 fs/open.c:1416 do_sys_open fs/open.c:1431 [inline] __do_sys_openat fs/open.c:1447 [inline] __se_sys_openat fs/open.c:1442 [inline] __x64_sys_openat+0xf3/0x120 fs/open.c:1442 x64_sys_call+0x1025/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff888102eb3260 of 4 bytes by task 3498 on cpu 0: inode_get_ctime_nsec include/linux/fs.h:1623 [inline] inode_get_ctime include/linux/fs.h:1629 [inline] generic_fillattr+0x1dd/0x2f0 fs/stat.c:62 shmem_getattr+0x17b/0x200 mm/shmem.c:1157 vfs_getattr_nosec fs/stat.c:166 [inline] vfs_getattr+0x19b/0x1e0 fs/stat.c:207 vfs_statx_path fs/stat.c:251 [inline] vfs_statx+0x134/0x2f0 fs/stat.c:315 vfs_fstatat+0xec/0x110 fs/stat.c:341 __do_sys_newfstatat fs/stat.c:505 [inline] __se_sys_newfstatat+0x58/0x260 fs/stat.c:499 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:499 x64_sys_call+0x141f/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:263 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e value changed: 0x2755ae53 -> 0x27ee44d3 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 3498 Comm: udevd Not tainted 6.11.0-rc6-syzkaller-00326-gd1f2d51b711a-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 ================================================================== When calling generic_fillattr(), if you don't hold read lock, data-race will occur in inode member variables, which can cause unexpected behavior. Since there is no special protection when shmem_getattr() calls generic_fillattr(), data-race occurs by functions such as shmem_unlink() or shmem_mknod(). This can cause unexpected results, so commenting it out is not enough. Therefore, when calling generic_fillattr() from shmem_getattr(), it is appropriate to protect the inode using inode_lock_shared() and inode_unlock_shared() to prevent data-race. Link: https://lkml.kernel.org/r/20240909123558.70229-1-aha310510@gmail.com Fixes: 44a30220bc0a ("shmem: recalculate file inode when fstat") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reported-by: syzbot <syzkaller@googlegroup.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAILJann Horn
vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs have already been torn down halfway (in a way we can't undo) but are still present in the maple tree. At this point, we *must* remove the VMAs from the VMA tree, otherwise we get UAF. Because removing VMA tree nodes can require memory allocation, the existing code has an error path which tries to handle this by reattaching the VMAs; but that can't be done safely. A nicer way to fix it would probably be to preallocate enough maple tree nodes for the removal before the point of no return, or something like that; but for now, fix it the easy and kinda ugly way, by marking this allocation __GFP_NOFAIL. Link: https://lkml.kernel.org/r/20241016-fix-munmap-abort-v1-1-601c94b2240d@google.com Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28x86/traps: move kmsan check after instrumentation_beginSabyrzhan Tasbolatov
During x86_64 kernel build with CONFIG_KMSAN, the objtool warns following: AR built-in.a AR vmlinux.a LD vmlinux.o vmlinux.o: warning: objtool: handle_bug+0x4: call to kmsan_unpoison_entry_regs() leaves .noinstr.text section OBJCOPY modules.builtin.modinfo GEN modules.builtin MODPOST Module.symvers CC .vmlinux.export.o Moving kmsan_unpoison_entry_regs() _after_ instrumentation_begin() fixes the warning. There is decode_bug(regs->ip, &imm) is left before KMSAN unpoisoining, but it has the return condition and if we include it after instrumentation_begin() it results the warning "return with instrumentation enabled", hence, I'm concerned that regs will not be KMSAN unpoisoned if `ud_type == BUG_NONE` is true. Link: https://lkml.kernel.org/r/20241016152407.3149001-1-snovitoll@gmail.com Fixes: ba54d194f8da ("x86/traps: avoid KMSAN bugs originating from handle_bug()") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28resource: remove dependency on SPARSEMEM from GET_FREE_REGIONHuang Ying
We want to use the functions (get_free_mem_region()) configured via GET_FREE_REGION in resource kunit tests. However, GET_FREE_REGION depends on SPARSEMEM now. This makes resource kunit tests cannot be built on some architectures lacking SPARSEMEM, or causes config warning as follows, WARNING: unmet direct dependencies detected for GET_FREE_REGION Depends on [n]: SPARSEMEM [=n] Selected by [y]: - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y] When get_free_mem_region() was introduced the only consumers were those looking to pass the address range to memremap_pages(). That address range needed to be mindful of the maximum addressable platform physical address which at the time only SPARSMEM defined via MAX_PHYSMEM_BITS. Given that memremap_pages() also depended on SPARSEMEM via ZONE_DEVICE, it was easier to just depend on that definition than invent a general MAX_PHYSMEM_BITS concept outside of SPARSEMEM. Turns out that decision was buggy and did not account for KASAN consumption of physical address space. That problem was resolved recently with commit ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space"), and GET_FREE_REGION dropped its MAX_PHYSMEM_BITS dependency. Then commit 99185c10d5d9 ("resource, kunit: add test case for region_intersects()"), went ahead and fixed up the only remaining dependency on SPARSEMEM which was usage of the PA_SECTION_SHIFT macro for setting the default alignment. A PAGE_SIZE fallback is fine in the SPARSEMEM=n case. With those build dependencies gone GET_FREE_REGION no longer depends on SPARSEMEM. So, the patch removes dependency on SPARSEMEM from GET_FREE_REGION to fix the build issues. Link: https://lkml.kernel.org/r/20241016014730.339369-1-ying.huang@intel.com Link: https://lore.kernel.org/lkml/20240922225041.603186-1-linux@roeck-us.net/ Link: https://lkml.kernel.org/r/20241015051554.294734-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Nathan Chancellor <nathan@kernel.org> # build Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/mmap: fix race in mmap_region() with ftruncate()Liam R. Howlett
Avoiding the zeroing of the vma tree in mmap_region() introduced a race with truncate in the page table walk. To avoid any races, create a hole in the rmap during the operation by clearing the pagetable entries earlier under the mmap write lock and (critically) before the new vma is installed into the vma tree. The result is that the old vma(s) are left in the vma tree, but free_pgtables() removes them from the rmap and clears the ptes while holding the necessary locks. This change extends the fix required for hugetblfs and the call_mmap() function by moving the cleanup higher in the function and running it unconditionally. Link: https://lkml.kernel.org/r/20241016013455.2241533-1-Liam.Howlett@oracle.com Fixes: f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/ Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reservesMatt Fleming
Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to fail even though free pages are available in the highatomic reserves. GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock() since it's only run from reclaim. Given that such allocations will pass the watermarks in __zone_watermark_unusable_free(), it makes sense to fallback to highatomic reserves the same way that ALLOC_OOM can. This fixes order-0 page allocation failures observed on Cloudflare's fleet when handling network packets: kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-7 CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1 Hardware name: MACHINE Call Trace: <IRQ> dump_stack_lvl+0x3c/0x50 warn_alloc+0x13a/0x1c0 __alloc_pages_slowpath.constprop.0+0xc9d/0xd10 __alloc_pages+0x327/0x340 __napi_alloc_skb+0x16d/0x1f0 bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en] bnxt_rx_pkt+0x201/0x15e0 [bnxt_en] __bnxt_poll_work+0x156/0x2b0 [bnxt_en] bnxt_poll+0xd9/0x1c0 [bnxt_en] __napi_poll+0x2b/0x1b0 bpf_trampoline_6442524138+0x7d/0x1000 __napi_poll+0x5/0x1b0 net_rx_action+0x342/0x740 handle_softirqs+0xcf/0x2b0 irq_exit_rcu+0x6c/0x90 sysvec_apic_timer_interrupt+0x72/0x90 </IRQ> [mfleming@cloudflare.com: update comment] Link: https://lkml.kernel.org/r/20241015125158.3597702-1-matt@readmodwrite.com Link: https://lkml.kernel.org/r/20241011120737.3300370-1-matt@readmodwrite.com Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/ Fixes: 1d91df85f399 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs") Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28fork: only invoke khugepaged, ksm hooks if no errorLorenzo Stoakes
There is no reason to invoke these hooks early against an mm that is in an incomplete state. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. Their placement early in dup_mmap() only appears to have been meaningful for early error checking, and since functionally it'd require a very small allocation to fail (in practice 'too small to fail') that'd only occur in the most dire circumstances, meaning the fork would fail or be OOM'd in any case. Since both khugepaged and KSM tracking are there to provide optimisations to memory performance rather than critical functionality, it doesn't really matter all that much if, under such dire memory pressure, we fail to register an mm with these. As a result, we follow the example of commit d2081b2bf819 ("mm: khugepaged: make khugepaged_enter() void function") and make ksm_fork() a void function also. We only expose the mm to these functions once we are done with them and only if no error occurred in the fork operation. Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28fork: do not invoke uffd on fork if error occursLorenzo Stoakes
Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/pagewalk: fix usage of pmd_leaf()/pud_leaf() without present checkDavid Hildenbrand
pmd_leaf()/pud_leaf() only implies a pmd_present()/pud_present() check on some architectures. We really should check for pmd_present()/pud_present() first. This should explain the report we got on ppc64 (which has CONFIG_PGTABLE_HAS_HUGE_LEAVES set in the config) that triggered: VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp))); Likely we had a PMD migration entry for which pmd_leaf() did not trigger. We raced with restoring the PMD migration entry, and suddenly saw a pmd_leaf(). In this case, pte_offset_map_lock() saved us from more trouble, because it rechecks the PMD value, but we would not have processed the migration entry -- which is not too bad because the only user of FW_MIGRATION is KSM for unsharing, and KSM only applies to small folios. Further, we shouldn't re-read the PMD/PUD value for our warning, the primary purpose of the VM_WARN_ON_ONCE() is to find spurious use of pmd_leaf()/pud_leaf() without CONFIG_PGTABLE_HAS_HUGE_LEAVES. As a side note, we are currently not implementing FW_MIGRATION support for PUD migration entries, which likely should exist due to hugetlb. Add a TODO so this won't fall through the cracks if more FW_MIGRATION users get added. Was able to write a quick reproducer and verify that the issue no longer triggers with this fix. https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/reproducers/move-pages-pmd-leaf.c Without this fix after a couple of seconds in a VM with 2 NUMA nodes: [ 54.333753] ------------[ cut here ]------------ [ 54.334901] WARNING: CPU: 20 PID: 1704 at mm/pagewalk.c:815 folio_walk_start+0x48f/0x6e0 [ 54.336455] Modules linked in: ... [ 54.345009] CPU: 20 UID: 0 PID: 1704 Comm: move-pages-pmd- Not tainted 6.12.0-rc2+ #81 [ 54.346529] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 54.348191] RIP: 0010:folio_walk_start+0x48f/0x6e0 [ 54.349134] Code: b5 ad 48 8d 35 00 00 00 00 e8 6d 59 d7 ff e8 08 74 da ff e9 9c fe ff ff 4c 8b 7c 24 08 4c 89 ff e8 26 2b be 00 e9 8a fe ff ff <0f> 0b e9 ec fe ff ff f7 c2 ff 0f 00 00 0f 85 81 fe ff ff 48 8b 02 [ 54.352660] RSP: 0018:ffffb7e4c430bc78 EFLAGS: 00010282 [ 54.353679] RAX: 80000002a3e008e7 RBX: ffff9946039aa580 RCX: ffff994380000000 [ 54.355056] RDX: ffff994606aec000 RSI: 00007f004b000000 RDI: 0000000000000000 [ 54.356440] RBP: 00007f004b000000 R08: 0000000000000591 R09: 0000000000000001 [ 54.357820] R10: 0000000000000200 R11: 0000000000000001 R12: ffffb7e4c430bd10 [ 54.359198] R13: ffff994606aec2c0 R14: 0000000000000002 R15: ffff994604a89b00 [ 54.360564] FS: 00007f004ae006c0(0000) GS:ffff9947f7400000(0000) knlGS:0000000000000000 [ 54.362111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.363242] CR2: 00007f004adffe58 CR3: 0000000281e12005 CR4: 0000000000770ef0 [ 54.364615] PKRU: 55555554 [ 54.365153] Call Trace: [ 54.365646] <TASK> [ 54.366073] ? __warn.cold+0xb7/0x14d [ 54.366796] ? folio_walk_start+0x48f/0x6e0 [ 54.367628] ? report_bug+0xff/0x140 [ 54.368324] ? handle_bug+0x58/0x90 [ 54.369019] ? exc_invalid_op+0x17/0x70 [ 54.369771] ? asm_exc_invalid_op+0x1a/0x20 [ 54.370606] ? folio_walk_start+0x48f/0x6e0 [ 54.371415] ? folio_walk_start+0x9e/0x6e0 [ 54.372227] do_pages_move+0x1c5/0x680 [ 54.372972] kernel_move_pages+0x1a1/0x2b0 [ 54.373804] __x64_sys_move_pages+0x25/0x30 Link: https://lkml.kernel.org/r/20241015111236.1290921-1-david@redhat.com Fixes: aa39ca6940f1 ("mm/pagewalk: introduce folio_walk_start() + folio_walk_end()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+7d917f67c05066cec295@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/670d3248.050a0220.3e960.0064.GAE@google.com Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()Cong Wang
The following race condition could trigger a NULL pointer dereference: sock_map_link_detach(): sock_map_link_update_prog(): mutex_lock(&sockmap_mutex); ... sockmap_link->map = NULL; mutex_unlock(&sockmap_mutex); mutex_lock(&sockmap_mutex); ... sock_map_prog_link_lookup(sockmap_link->map); mutex_unlock(&sockmap_mutex); <continue> Fix it by adding a NULL pointer check. In this specific case, it makes no sense to update a link which is being released. Reported-by: Ruan Bonan <bonan.ruan@u.nus.edu> Fixes: 699c23f02c65 ("bpf: Add bpf_link support for sk_msg and sk_skb progs") Cc: Yonghong Song <yonghong.song@linux.dev> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20241026185522.338562-1-xiyou.wangcong@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-28neighbour: use kvzalloc()/kvfree()Eric Dumazet
mm layer is providing convenient functions, we do not have to work around old limitations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241022150059.1345406-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-28netlink: specs: Add missing phy-ntf command to ethtool specKory Maincent
ETHTOOL_MSG_PHY_NTF description is missing in the ethtool netlink spec. Add it to the spec. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20241022151418.875424-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-28vsock: do not leave dangling sk pointer in vsock_create()Eric Dumazet
syzbot was able to trigger the following warning after recent core network cleanup. On error vsock_create() frees the allocated sk object, but sock_init_data() has already attached it to the provided sock object. We must clear sock->sk to avoid possible use-after-free later. WARNING: CPU: 0 PID: 5282 at net/socket.c:1581 __sock_create+0x897/0x950 net/socket.c:1581 Modules linked in: CPU: 0 UID: 0 PID: 5282 Comm: syz.2.43 Not tainted 6.12.0-rc2-syzkaller-00667-g53bac8330865 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:__sock_create+0x897/0x950 net/socket.c:1581 Code: 7f 06 01 65 48 8b 34 25 00 d8 03 00 48 81 c6 b0 08 00 00 48 c7 c7 60 0b 0d 8d e8 d4 9a 3c 02 e9 11 f8 ff ff e8 0a ab 0d f8 90 <0f> 0b 90 e9 82 fd ff ff 89 e9 80 e1 07 fe c1 38 c1 0f 8c c7 f8 ff RSP: 0018:ffffc9000394fda8 EFLAGS: 00010293 RAX: ffffffff89873c46 RBX: ffff888079f3c818 RCX: ffff8880314b9e00 RDX: 0000000000000000 RSI: 00000000ffffffed RDI: 0000000000000000 RBP: ffffffff8d3337f0 R08: ffffffff8987384e R09: ffffffff8989473a R10: dffffc0000000000 R11: fffffbfff203a276 R12: 00000000ffffffed R13: ffff888079f3c8c0 R14: ffffffff898736e7 R15: dffffc0000000000 FS: 00005555680ab500(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f22b11196d0 CR3: 00000000308c0000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> sock_create net/socket.c:1632 [inline] __sys_socket_create net/socket.c:1669 [inline] __sys_socket+0x150/0x3c0 net/socket.c:1716 __do_sys_socket net/socket.c:1730 [inline] __se_sys_socket net/socket.c:1728 [inline] __x64_sys_socket+0x7a/0x90 net/socket.c:1728 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f22b117dff9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff56aec0e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000029 RAX: ffffffffffffffda RBX: 00007f22b1335f80 RCX: 00007f22b117dff9 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000028 RBP: 00007f22b11f0296 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f22b1335f80 R14: 00007f22b1335f80 R15: 00000000000012dd Fixes: 48156296a08c ("net: warn, if pf->create does not clear sock->sk on error") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20241022134819.1085254-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-28net/mlx5: unique names for per device cachesSebastian Ott
Add the device name to the per device kmem_cache names to ensure their uniqueness. This fixes warnings like this: "kmem_cache of name 'mlx5_fs_fgs' already exists". Signed-off-by: Sebastian Ott <sebott@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241023134146.28448-1-sebott@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>