summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-27eth: forcedeth: remove local wrappers for napi enable/disableJakub Kicinski
The local helpers for calling napi_enable() and napi_disable() don't serve much purpose and they will complicate the fix in the subsequent patch. Remove them, call the core functions directly. Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250124031841.1179756-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27eth: tg3: fix calling napi_enable() in atomic contextJakub Kicinski
tg3 has a spin lock protecting most of the config, switch to taking netdev_lock() explicitly on enable/start paths. Disable/stop paths seem to not be under the spin lock (since napi_disable() already needs to sleep), so leave that side as is. tg3_restart_hw() releases and re-takes the spin lock, we need to do the same because dev_close() needs to take netdev_lock(). Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250124031841.1179756-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27tools: ynl: c: correct reverse decode of empty attrsJakub Kicinski
netlink reports which attribute was incorrect by sending back an attribute offset. Offset points to the address of struct nlattr, but to interpret the type we also need the nesting path. Attribute IDs have different meaning in different nests of the same message. Correct the condition for "is the offset within current attribute". ynl_attr_data_len() does not include the attribute header, so the end offset was off by 4 bytes. This means that we'd always skip over flags and empty nests. The devmem tests, for example, issues an invalid request with empty queue nests, resulting in the following error: YNL failed: Kernel error: missing attribute: .queues.ifindex The message is incorrect, "queues" nest does not have an "ifindex" attribute defined. With this fix we decend correctly into the nest: YNL failed: Kernel error: missing attribute: .queues.id Fixes: 86878f14d71a ("tools: ynl: user space helpers") Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250124012130.1121227-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27ptp: Ensure info->enable callback is always setThomas Weißschuh
The ioctl and sysfs handlers unconditionally call the ->enable callback. Not all drivers implement that callback, leading to NULL dereferences. Example of affected drivers: ptp_s390.c, ptp_vclock.c and ptp_mock.c. Instead use a dummy callback if no better was specified by the driver. Fixes: d94ba80ebbea ("ptp: Added a brand new class driver for ptp clocks.") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250123-ptp-enable-v1-1-b015834d3a47@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27documentation: networking: fix spelling mistakesKhaled Elnaggar
Fix a couple of typos/spelling mistakes in the documentation. Signed-off-by: Khaled Elnaggar <khaledelnaggarlinux@gmail.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20250123082521.59997-1-khaledelnaggarlinux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27net/mlx5e: add missing cpu_to_node to kvzalloc_node in mlx5e_open_xdpredirect_sqStanislav Fomichev
kvzalloc_node is not doing a runtime check on the node argument (__alloc_pages_node_noprof does have a VM_BUG_ON, but it expands to nothing on !CONFIG_DEBUG_VM builds), so doing any ethtool/netlink operation that calls mlx5e_open on a CPU that's larger that MAX_NUMNODES triggers OOB access and panic (see the trace below). Add missing cpu_to_node call to convert cpu id to node id. [ 165.427394] mlx5_core 0000:5c:00.0 beth1: Link up [ 166.479327] BUG: unable to handle page fault for address: 0000000800000010 [ 166.494592] #PF: supervisor read access in kernel mode [ 166.505995] #PF: error_code(0x0000) - not-present page ... [ 166.816958] Call Trace: [ 166.822380] <TASK> [ 166.827034] ? __die_body+0x64/0xb0 [ 166.834774] ? page_fault_oops+0x2cd/0x3f0 [ 166.843862] ? exc_page_fault+0x63/0x130 [ 166.852564] ? asm_exc_page_fault+0x22/0x30 [ 166.861843] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.871897] ? get_partial_node+0x1c/0x320 [ 166.880983] ? deactivate_slab+0x269/0x2b0 [ 166.890069] ___slab_alloc+0x521/0xa90 [ 166.898389] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.908442] __kmalloc_node_noprof+0x216/0x3f0 [ 166.918302] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.928354] __kvmalloc_node_noprof+0x43/0xd0 [ 166.938021] mlx5e_open_channels+0x5e2/0xc00 [ 166.947496] mlx5e_open_locked+0x3e/0xf0 [ 166.956201] mlx5e_open+0x23/0x50 [ 166.963551] __dev_open+0x114/0x1c0 [ 166.971292] __dev_change_flags+0xa2/0x1b0 [ 166.980378] dev_change_flags+0x21/0x60 [ 166.988887] do_setlink+0x38d/0xf20 [ 166.996628] ? ep_poll_callback+0x1b9/0x240 [ 167.005910] ? __nla_validate_parse.llvm.10713395753544950386+0x80/0xd70 [ 167.020782] ? __wake_up_sync_key+0x52/0x80 [ 167.030066] ? __mutex_lock+0xff/0x550 [ 167.038382] ? security_capable+0x50/0x90 [ 167.047279] rtnl_setlink+0x1c9/0x210 [ 167.055403] ? ep_poll_callback+0x1b9/0x240 [ 167.064684] ? security_capable+0x50/0x90 [ 167.073579] rtnetlink_rcv_msg+0x2f9/0x310 [ 167.082667] ? rtnetlink_bind+0x30/0x30 [ 167.091173] netlink_rcv_skb+0xb1/0xe0 [ 167.099492] netlink_unicast+0x20f/0x2e0 [ 167.108191] netlink_sendmsg+0x389/0x420 [ 167.116896] __sys_sendto+0x158/0x1c0 [ 167.125024] __x64_sys_sendto+0x22/0x30 [ 167.133534] do_syscall_64+0x63/0x130 [ 167.141657] ? __irq_exit_rcu.llvm.17843942359718260576+0x52/0xd0 [ 167.155181] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: bb135e40129d ("net/mlx5e: move XDP_REDIRECT sq to dynamic allocation") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250123000407.3464715-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27net: netdevsim: try to close UDP port harness racesJakub Kicinski
syzbot discovered that we remove the debugfs files after we free the netdev. Try to clean up the relevant dir while the device is still around. Reported-by: syzbot+2e5de9e3ab986b71d2bf@syzkaller.appspotmail.com Fixes: 424be63ad831 ("netdevsim: add UDP tunnel port offload support") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250122224503.762705-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27net: rose: fix timer races against user threadsEric Dumazet
Rose timers only acquire the socket spinlock, without checking if the socket is owned by one user thread. Add a check and rearm the timers if needed. BUG: KASAN: slab-use-after-free in rose_timer_expiry+0x31d/0x360 net/rose/rose_timer.c:174 Read of size 2 at addr ffff88802f09b82a by task swapper/0/0 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc5-syzkaller-00172-gd1bf27c4e176 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x169/0x550 mm/kasan/report.c:489 kasan_report+0x143/0x180 mm/kasan/report.c:602 rose_timer_expiry+0x31d/0x360 net/rose/rose_timer.c:174 call_timer_fn+0x187/0x650 kernel/time/timer.c:1793 expire_timers kernel/time/timer.c:1844 [inline] __run_timers kernel/time/timer.c:2418 [inline] __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2430 run_timer_base kernel/time/timer.c:2439 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2449 handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:561 __do_softirq kernel/softirq.c:595 [inline] invoke_softirq kernel/softirq.c:435 [inline] __irq_exit_rcu+0xf7/0x220 kernel/softirq.c:662 irq_exit_rcu+0x9/0x30 kernel/softirq.c:678 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1049 </IRQ> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250122180244.1861468-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27net: the appletalk subsystem no longer uses ndo_do_ioctl谢致邦 (XIE Zhibang)
ndo_do_ioctl is no longer used by the appletalk subsystem after commit 45bd1c5ba758 ("net: appletalk: Drop aarp_send_probe_phase1()"). Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/tencent_4AC6ED413FEA8116B4253D3ED6947FDBCF08@qq.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27fuse: prevent disabling io-uring on active connectionsBernd Schubert
The enable_uring module parameter allows administrators to enable/disable io-uring support for FUSE at runtime. However, disabling io-uring while connections already have it enabled can lead to an inconsistent state. Fix this by keeping io-uring enabled on connections that were already using it, even if the module parameter is later disabled. This ensures active FUSE mounts continue to function correctly. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: enable fuse-over-io-uringBernd Schubert
All required parts are handled now, fuse-io-uring can be enabled. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: block request allocation until io-uring init is completeBernd Schubert
Avoid races and block request allocation until io-uring queues are ready. This is a especially important for background requests, as bg request completion might cause lock order inversion of the typical queue->lock and then fc->bg_lock fuse_request_end spin_lock(&fc->bg_lock); flush_bg_queue fuse_send_one fuse_uring_queue_fuse_req spin_lock(&queue->lock); Signed-off-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: {io-uring} Prevent mount point hang on fuse-server terminationBernd Schubert
When the fuse-server terminates while the fuse-client or kernel still has queued URING_CMDs, these commands retain references to the struct file used by the fuse connection. This prevents fuse_dev_release() from being invoked, resulting in a hung mount point. This patch addresses the issue by making queued URING_CMDs cancelable, allowing fuse_dev_release() to proceed as expected and preventing the mount point from hanging. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: Allow to queue bg requests through io-uringBernd Schubert
This prepares queueing and sending background requests through io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: Allow to queue fg requests through io-uringBernd Schubert
This prepares queueing and sending foreground requests through io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: {io-uring} Make fuse_dev_queue_{interrupt,forget} non-staticBernd Schubert
These functions are also needed by fuse-over-io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: {io-uring} Handle teardown of ring entriesBernd Schubert
On teardown struct file_operations::uring_cmd requests need to be completed by calling io_uring_cmd_done(). Not completing all ring entries would result in busy io-uring tasks giving warning messages in intervals and unreleased struct file. Additionally the fuse connection and with that the ring can only get released when all io-uring commands are completed. Completion is done with ring entries that are a) in waiting state for new fuse requests - io_uring_cmd_done is needed b) already in userspace - io_uring_cmd_done through teardown is not needed, the request can just get released. If fuse server is still active and commits such a ring entry, fuse_uring_cmd() already checks if the connection is active and then complete the io-uring itself with -ENOTCONN. I.e. special handling is not needed. This scheme is basically represented by the ring entry state FRRS_WAIT and FRRS_USERSPACE. Entries in state: - FRRS_INIT: No action needed, do not contribute to ring->queue_refs yet - All other states: Are currently processed by other tasks, async teardown is needed and it has to wait for the two states above. It could be also solved without an async teardown task, but would require additional if conditions in hot code paths. Also in my personal opinion the code looks cleaner with async teardown. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27Merge tag 'mips_6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS updates from Thomas Bogendoerfer: "Cleanups and fixes" * tag 'mips_6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: pci-legacy: Override pci_address_to_pio MIPS: Loongson64: env: Use str_on_off() helper in prom_lefi_init_env() MIPS: migrate to generic rule for built-in DTBs mips: fix shmctl/semctl/msgctl syscall for o32 mips/math-emu: fix emulation of the prefx instruction MIPS: Loongson: Add comments for interface_info MIPS: Loongson64: remove ROM Size unit in boardinfo MIPS: traps: Use str_enabled_disabled() in parity_protection_init() MIPS: ftrace: Declare ftrace_get_parent_ra_addr() as static Revert "MIPS: csrc-r4k: Select HAVE_UNSTABLE_SCHED_CLOCK if SMP && 64BIT" MIPS: Fix the wrong format specifier MIPS: Add a blank line after __HEAD MIPS: kernel: Rename read/write_c0_ecc to read/writec0_errctl
2025-01-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linuxLinus Torvalds
Pull ARM updates from Russell King: - fix typos in vfpmodule.c - drop obsolete VFP accessor fallback for old assemblers - add cache line identifier register accessor functions - add cacheinfo support * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux: ARM: 9440/1: cacheinfo fix format field mask ARM: 9433/2: implement cacheinfo support ARM: 9432/2: add CLIDR accessor functions ARM: 9438/1: assembler: Drop obsolete VFP accessor fallback ARM: 9437/1: vfp: Fix typographical errors in vfpmodule.c
2025-01-27vfio/nvgrace-gpu: Add GB200 SKU to the devid tableAnkit Agrawal
NVIDIA is productizing the new Grace Blackwell superchip SKU bearing device ID 0x2941. Add the SKU devid to nvgrace_gpu_vfio_pci_table. CC: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27vfio/nvgrace-gpu: Check the HBM training and C2C link statusAnkit Agrawal
In contrast to Grace Hopper systems, the HBM training has been moved out of the UEFI on the Grace Blackwell systems. This reduces the system bootup time significantly. The onus of checking whether the HBM training has completed thus falls on the module. The HBM training status can be determined from a BAR0 register. Similarly, another BAR0 register exposes the status of the CPU-GPU chip-to-chip (C2C) cache coherent interconnect. Based on testing, 30s is determined to be sufficient to ensure initialization completion on all the Grace based systems. Thus poll these register and check for 30s. If the HBM training is not complete or if the C2C link is not ready, fail the probe. While the time is not required on Grace Hopper systems, it is beneficial to make the check to ensure the device is in an expected state. Hence keeping it generalized to both the generations. Ensure that the BAR0 is enabled before accessing the registers. CC: Alex Williamson <alex.williamson@redhat.com> CC: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VMAnkit Agrawal
There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Cc: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmemAnkit Agrawal
NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a continuation with the Grace Hopper (GH) superchip that provides a cache coherent access to CPU and GPU to each other's memory with an internal proprietary chip-to-chip cache coherent interconnect. There is a HW defect on GH systems to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region with uncached mapping carved out from the device memory. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. This is fixed on the GB systems. The presence of the fix for the HW defect is communicated by the device firmware through the DVSEC PCI config register with ID 3. The module reads this to take a different codepath on GB vs GH. Scan through the DVSEC registers to identify the correct one and use it to determine the presence of the fix. Save the value in the device's nvgrace_gpu_pci_core_device structure. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] CC: Jason Gunthorpe <jgg@nvidia.com> CC: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27fuse: Add io-uring sqe commit and fetch supportBernd Schubert
This adds support for fuse request completion through ring SQEs (FUSE_URING_CMD_COMMIT_AND_FETCH handling). After committing the ring entry it becomes available for new fuse requests. Handling of requests through the ring (SQE/CQE handling) is complete now. Fuse request data are copied through the mmaped ring buffer, there is no support for any zero copy yet. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27Merge tag 'm68knommu-for-v6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu update from Greg Ungerer: "Just a single fix to correct the clock rate defined for the internal timer hardware blocks of the ColdFire 5441x family of SoC devices" * tag 'm68knommu-for-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: m68k: coldfire: Use proper clock rate for timers
2025-01-27Merge tag 'xtensa-20250126' of https://github.com/jcmvbkbc/linux-xtensaLinus Torvalds
Pull xtensa updates from Max Filippov: - a few one-liner cleanups * tag 'xtensa-20250126' of https://github.com/jcmvbkbc/linux-xtensa: xtensa/simdisk: Use str_write_read() helper in simdisk_transfer() xtensa: Remove zero-length alignment array xtensa: annotate dtb_start variable as static __initdata
2025-01-27virtio_blk: Add support for transport error recoveryIsrael Rukshin
Add support for proper cleanup and re-initialization of virtio-blk devices during transport reset error recovery flow. This enhancement includes: - Pre-reset handler (reset_prepare) to perform device-specific cleanup - Post-reset handler (reset_done) to re-initialize the device These changes allow the device to recover from various reset scenarios, ensuring proper functionality after a reset event occurs. Without this implementation, the device cannot properly recover from resets, potentially leading to undefined behavior or device malfunction. This feature has been tested using PCI transport with Function Level Reset (FLR) as an example reset mechanism. The reset can be triggered manually via sysfs (echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset). Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1732690652-3065-3-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27virtio_pci: Add support for PCIe Function Level ResetIsrael Rukshin
Implement support for Function Level Reset (FLR) in virtio_pci devices. This change adds reset_prepare and reset_done callbacks, allowing drivers to properly handle FLR operations. Without this patch, performing and recovering from an FLR is not possible for virtio_pci devices. This implementation ensures proper FLR handling and recovery for both physical and virtual functions. The device reset can be triggered in case of error or manually via sysfs: echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1732690652-3065-2-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vhost/net: Set num_buffers for virtio 1.0Akihiko Odaki
The specification says the device MUST set num_buffers to 1 if VIRTIO_NET_F_MRG_RXBUF has not been negotiated. Fixes: 41e3e42108bc ("vhost/net: enable virtio 1.0") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240915-v1-v1-1-f10d2cb5e759@daynix.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa/octeon_ep: read vendor-specific PCI capabilityShijith Thotton
Added support to read the vendor-specific PCI capability to identify the type of device being emulated. Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-4-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27virtio-pci: define type and header for PCI vendor dataShijith Thotton
Added macro definition for VIRTIO_PCI_CAP_VENDOR_CFG to identify the PCI vendor data type in the virtio_pci_cap structure. Defined a new struct virtio_pci_vndr_data for the vendor data capability header as per the specification. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-3-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa/octeon_ep: handle device config change eventsSatha Rao
The first interrupt of the device is used to notify the host about device configuration changes, such as link status updates. The ISR configuration area is updated to indicate a config change event when triggered. Signed-off-by: Satha Rao <skoteshwar@marvell.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-2-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa/octeon_ep: enable support for multiple interrupts per deviceShijith Thotton
Updated the driver to utilize all the MSI-X interrupt vectors supported by each OCTEON endpoint VF, instead of relying on a single vector. Enabling more interrupts allows packets from multiple rings to be distributed across multiple cores, improving parallelism and performance. Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-1-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa: solidrun: Replace deprecated PCI functionsPhilipp Stanner
The PCI functions pcim_iomap_regions() pcim_iounmap_regions() pcim_iomap_table() have been deprecated by the PCI subsystem. Replace these functions with their successors pcim_iomap_region() and pcim_iounmap_region(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Message-Id: <20241219094428.21511-2-phasta@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
2025-01-27s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM)David Hildenbrand
Let's add support for including virtio-mem device RAM in the crash dump, setting NEED_PROC_VMCORE_DEVICE_RAM, and implementing elfcorehdr_fill_device_ram_ptload_elf64(). To avoid code duplication, factor out the code to fill a PT_LOAD entry. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-13-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAMDavid Hildenbrand
Let's implement the get_device_ram() vmcore callback, so architectures that select NEED_PROC_VMCORE_NEED_DEVICE_RAM, like s390 soon, can include that memory in a crash dump. Merge ranges, and process ranges that might contain a mixture of plugged and unplugged, to reduce the total number of ranges. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-12-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27virtio-mem: remember usable region sizeDavid Hildenbrand
Let's remember the usable region size, which will be helpful in kdump mode next. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-11-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27virtio-mem: mark device ready before registering callbacks in kdump modeDavid Hildenbrand
After the callbacks are registered we may immediately get a callback. So mark the device ready before registering the callbacks. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-10-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges ↵David Hildenbrand
in 2nd kernel s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in the crashed kernel. RAM provided by memory devices such as virtio-mem can only be detected using the device driver; when vmcore_init() is called, these device drivers are usually not loaded yet, or the devices did not get probed yet. Consequently, on s390 these RAM ranges will not be included in the crash dump, which makes the dump partially corrupt and is unfortunate. Instead of deferring the vmcore_init() call, to an (unclear?) later point, let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as the device drivers probe the device and get access to this information. Then, we'll add these ranges to the vmcore, adding more PT_LOAD entries and updating the offsets+vmcore size. Use a separate Kconfig option to be set by an architecture to include this code only if the arch really needs it. Further, we'll make the config depend on the relevant drivers (i.e., virtio_mem) once they implement support (next). The alternative of having a PROVIDE_PROC_VMCORE_DEVICE_RAM config option was dropped for now for simplicity. The current target use case is s390, which only creates an elf64 elfcore, so focusing on elf64 is sufficient. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-9-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: factor out freeing a list of vmcore rangesDavid Hildenbrand
Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-8-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: factor out allocating a vmcore range and adding it to a listDavid Hildenbrand
Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-7-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: move vmcore definitions out of kcore.hDavid Hildenbrand
These vmcore defines are not related to /proc/kcore, move them out. We'll move "struct vmcoredd_node" to vmcore.c, because it is only used internally. While "struct vmcore" is only used internally for now, we're planning on using it from inline functions in crash_dump.h next, so move it to crash_dump.h. While at it, rename "struct vmcore" to "struct vmcore_range", which is a more suitable name and will make the usage of it outside of vmcore.c clearer. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-6-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: prefix all pr_* with "vmcore:"David Hildenbrand
Let's use "vmcore: " as a prefix, converting the single "Kdump: vmcore not initialized" one to effectively be "vmcore: not initialized". Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-5-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: disallow vmcore modifications while the vmcore is openDavid Hildenbrand
The vmcoredd_update_size() call and its effects (size/offset changes) are currently completely unsynchronized, and will cause trouble when performed concurrently, or when done while someone is already reading the vmcore. Let's protect all vmcore modifications by the vmcore_mutex, disallow vmcore modifications while the vmcore is open, and warn on vmcore modifications after the vmcore was already opened once: modifications while the vmcore is open are unsafe, and modifications after the vmcore was opened indicates trouble. Properly synchronize against concurrent opening of the vmcore. No need to grab the mutex during mmap()/read(): after we opened the vmcore, modifications are impossible. It's worth noting that modifications after the vmcore was opened are completely unexpected, so failing if open, and warning if already opened (+closed again) is good enough. This change not only handles concurrent adding of device dumps + concurrent reading of the vmcore properly, it also prepares for other mechanisms that will modify the vmcore. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-4-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutexDavid Hildenbrand
Now that we have a mutex that synchronizes against opening of the vmcore, let's use that one to replace vmcoredd_mutex: there is no need to have two separate ones. This is a preparation for properly preventing vmcore modifications after the vmcore was opened. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-3-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutexDavid Hildenbrand
We want to protect vmcore modifications from concurrent opening of the vmcore, and also serialize vmcore modification. (a) We can currently modify the vmcore after it was opened. This can happen if a vmcoredd is added after the vmcore module was initialized and already opened by user space. We want to fix that and prepare for new code wanting to serialize against concurrent opening. (b) To handle it cleanly we need to protect the modifications against concurrent opening. As the modifications end up allocating memory and can sleep, we cannot rely on the spinlock. Let's convert the spinlock into a mutex to prepare for further changes. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-2-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27net/ncsi: use dev_set_mac_address() for Get MC MAC Address handlingPaul Fertser
Copy of the rationale from 790071347a0a1a89e618eedcd51c687ea783aeb3: Change ndo_set_mac_address to dev_set_mac_address because dev_set_mac_address provides a way to notify network layer about MAC change. In other case, services may not aware about MAC change and keep using old one which set from network adapter driver. As example, DHCP client from systemd do not update MAC address without notification from net subsystem which leads to the problem with acquiring the right address from DHCP server. Since dev_set_mac_address requires RTNL lock the operation can not be performed directly in the response handler, see 9e2bbab94b88295dcc57c7580393c9ee08d7314d. The way of selecting the first suitable MAC address from the list is changed, instead of having the driver check it this patch just assumes any valid MAC should be good. Fixes: b8291cf3d118 ("net/ncsi: Add NC-SI 1.2 Get MC MAC Address command") Signed-off-by: Paul Fertser <fercerpav@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-26bcachefs: Improve trace_move_extent_finishKent Overstreet
We're currently debugging issues with rebalance, where it's not making progress as quickly as it should be (or sometimes not at all). Add the full data_update to the move_extent_finish tracepoint, so we can check that the replicas we wrote match what we were supposed to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26bcachefs: Fix trace_copygcKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26bcachefs: Journal writes are now IOPRIO_CLASS_RTKent Overstreet
System performance is particularly sensitive to journal write latency, the number of outstanding journal writes is bounded and we can't issue journal flushes until other journal writes have completed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>