summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-10-09selftests: netfilter: conntrack_vrf.sh: add fib test caseFlorian Westphal
meta iifname veth0 ip daddr ... fib daddr oif ... is expected to return "dummy0" interface which is part of same vrf as veth0. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-09netfilter: fib: check correct rtable in vrf setupsFlorian Westphal
We need to init l3mdev unconditionally, else main routing table is searched and incorrect result is returned unless strict (iif keyword) matching is requested. Next patch adds a selftest for this. Fixes: 2a8a7c0eaa87 ("netfilter: nft_fib: Fix for rpath check with VRF devices") Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1761 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-09netfilter: xtables: avoid NFPROTO_UNSPEC where neededFlorian Westphal
syzbot managed to call xt_cluster match via ebtables: WARNING: CPU: 0 PID: 11 at net/netfilter/xt_cluster.c:72 xt_cluster_mt+0x196/0x780 [..] ebt_do_table+0x174b/0x2a40 Module registers to NFPROTO_UNSPEC, but it assumes ipv4/ipv6 packet processing. As this is only useful to restrict locally terminating TCP/UDP traffic, register this for ipv4 and ipv6 family only. Pablo points out that this is a general issue, direct users of the set/getsockopt interface can call into targets/matches that were only intended for use with ip(6)tables. Check all UNSPEC matches and targets for similar issues: - matches and targets are fine except if they assume skb_network_header() is valid -- this is only true when called from inet layer: ip(6) stack pulls the ip/ipv6 header into linear data area. - targets that return XT_CONTINUE or other xtables verdicts must be restricted too, they are incompatbile with the ebtables traverser, e.g. EBT_CONTINUE is a completely different value than XT_CONTINUE. Most matches/targets are changed to register for NFPROTO_IPV4/IPV6, as they are provided for use by ip(6)tables. The MARK target is also used by arptables, so register for NFPROTO_ARP too. While at it, bail out if connbytes fails to enable the corresponding conntrack family. This change passes the selftests in iptables.git. Reported-by: syzbot+256c348558aa5cf611a9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netfilter-devel/66fec2e2.050a0220.9ec68.0047.GAE@google.com/ Fixes: 0269ea493734 ("netfilter: xtables: add cluster match") Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-09mm: zswap: delete comments for "value" member of 'struct zswap_entry'.Kanchana P Sridhar
Made a minor edit in the comments for 'struct zswap_entry' to delete the description of the 'value' member that was deleted in commit 20a5532ffa53 ("mm: remove code to handle same filled pages"). Link: https://lkml.kernel.org/r/20241002194213.30041-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Fixes: 20a5532ffa53 ("mm: remove code to handle same filled pages") Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Usama Arif <usamaarif642@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09CREDITS: sort alphabetically by nameKrzysztof Kozlowski
Re-sort few misplaced entries in the CREDITS file. Link: https://lkml.kernel.org/r/20241002111932.46012-1-krzysztof.kozlowski@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09secretmem: disable memfd_secret() if arch cannot set direct mapPatrick Roy
Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This is the case for example on some arm64 configurations, where marking 4k PTEs in the direct map not present can only be done if the direct map is set up at 4k granularity in the first place (as ARM's break-before-make semantics do not easily allow breaking apart large/gigantic pages). More precisely, on arm64 systems with !can_set_direct_map(), set_direct_map_invalid_noflush() is a no-op, however it returns success (0) instead of an error. This means that memfd_secret will seemingly "work" (e.g. syscall succeeds, you can mmap the fd and fault in pages), but it does not actually achieve its goal of removing its memory from the direct map. Note that with this patch, memfd_secret() will start erroring on systems where can_set_direct_map() returns false (arm64 with CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and CONFIG_KFENCE=n), but that still seems better than the current silent failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most arm64 systems actually have a working memfd_secret() and aren't be affected. From going through the iterations of the original memfd_secret patch series, it seems that disabling the syscall in these scenarios was the intended behavior [1] (preferred over having set_direct_map_invalid_noflush return an error as that would result in SIGBUSes at page-fault time), however the check for it got dropped between v16 [2] and v17 [3], when secretmem moved away from CMA allocations. [1]: https://lore.kernel.org/lkml/20201124164930.GK8537@kernel.org/ [2]: https://lore.kernel.org/lkml/20210121122723.3446-11-rppt@kernel.org/#t [3]: https://lore.kernel.org/lkml/20201125092208.12544-10-rppt@kernel.org/ Link: https://lkml.kernel.org/r/20241001080056.784735-1-roypat@amazon.co.uk Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Patrick Roy <roypat@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09.mailmap: update Fangrui's emailFangrui Song
I'm leaving Google. Link: https://lkml.kernel.org/r/20240927192912.31532-1-i@maskray.me Signed-off-by: Fangrui Song <i@maskray.me> Acked-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09mm/huge_memory: check pmd_special() only after pmd_present()David Hildenbrand
We should only check for pmd_special() after we made sure that we have a present PMD. For example, if we have a migration PMD, pmd_special() might indicate that we have a special PMD although we really don't. This fixes confusing migration entries as PFN mappings, and not doing what we are supposed to do in the "is_swap_pmd()" case further down in the function -- including messing up COW, page table handling and accounting. Link: https://lkml.kernel.org/r/20240926154234.2247217-1-david@redhat.com Fixes: bc02afbd4d73 ("mm/fork: accept huge pfnmap entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+bf2c35fa302ebe3c7471@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/66f15c8d.050a0220.c23dd.000f.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09resource, kunit: fix user-after-free in resource_test_region_intersects()Huang Ying
In resource_test_insert_resource(), the pointer is used in error message after kfree(). This is user-after-free. To fix this, we need to call kunit_add_action_or_reset() to schedule memory freeing after usage. But kunit_add_action_or_reset() itself may fail and free the memory. So, its return value should be checked and abort the test for failure. Then, we found that other usage of kunit_add_action_or_reset() in resource_test_region_intersects() needs to be fixed too. We fix all these user-after-free bugs in this patch. Link: https://lkml.kernel.org/r/20240930070611.353338-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Kees Bakker <kees@ijzerbout.nl> Closes: https://lore.kernel.org/lkml/87ldzaotcg.fsf@yhuang6-desk2.ccr.corp.intel.com/ Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09fs/proc/kcore.c: allow translation of physical memory addressesAlexander Gordeev
When /proc/kcore is read an attempt to read the first two pages results in HW-specific page swap on s390 and another (so called prefix) pages are accessed instead. That leads to a wrong read. Allow architecture-specific translation of memory addresses using kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That way an architecture can deal with specific physical memory ranges. Re-use the existing /dev/mem callback implementation on s390, which handles the described prefix pages swapping correctly. For other architectures the default callback is basically NOP. It is expected the condition (vaddr == __va(__pa(vaddr))) always holds true for KCORE_RAM memory type. Link: https://lkml.kernel.org/r/20240930122119.1651546-1-agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map testDonet Tom
The hmm2 double_map test was failing due to an incorrect buffer->mirror size. The buffer->mirror size was 6, while buffer->ptr size was 6 * PAGE_SIZE. The test failed because the kernel's copy_to_user function was attempting to copy a 6 * PAGE_SIZE buffer to buffer->mirror. Since the size of buffer->mirror was incorrect, copy_to_user failed. This patch corrects the buffer->mirror size to 6 * PAGE_SIZE. Test Result without this patch ============================== # RUN hmm2.hmm2_device_private.double_map ... # hmm-tests.c:1680:double_map:Expected ret (-14) == 0 (0) # double_map: Test terminated by assertion # FAIL hmm2.hmm2_device_private.double_map not ok 53 hmm2.hmm2_device_private.double_map Test Result with this patch =========================== # RUN hmm2.hmm2_device_private.double_map ... # OK hmm2.hmm2_device_private.double_map ok 53 hmm2.hmm2_device_private.double_map Link: https://lkml.kernel.org/r/20240927050752.51066-1-donettom@linux.ibm.com Fixes: fee9f6d1b8df ("mm/hmm/test: add selftests for HMM") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09device-dax: correct pgoff align in dax_set_mapping()Kun(llfl)
pgoff should be aligned using ALIGN_DOWN() instead of ALIGN(). Otherwise, vmf->address not aligned to fault_size will be aligned to the next alignment, that can result in memory failure getting the wrong address. It's a subtle situation that only can be observed in page_mapped_in_vma() after the page is page fault handled by dev_dax_huge_fault. Generally, there is little chance to perform page_mapped_in_vma in dev-dax's page unless in specific error injection to the dax device to trigger an MCE - memory-failure. In that case, page_mapped_in_vma() will be triggered to determine which task is accessing the failure address and kill that task in the end. We used self-developed dax device (which is 2M aligned mapping) , to perform error injection to random address. It turned out that error injected to non-2M-aligned address was causing endless MCE until panic. Because page_mapped_in_vma() kept resulting wrong address and the task accessing the failure address was never killed properly: [ 3783.719419] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.049006] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.049190] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.448042] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.448186] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.792026] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.792179] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.162502] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.162633] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.461116] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.461247] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.764730] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.764859] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.042128] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.042259] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.464293] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.464423] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.818090] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.818217] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3787.085297] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3787.085424] Memory failure: 0x200c9742: recovery action for dax page: Recovered It took us several weeks to pinpoint this problem,  but we eventually used bpftrace to trace the page fault and mce address and successfully identified the issue. Joao added: ; Likely we never reproduce in production because we always pin : device-dax regions in the region align they provide (Qemu does : similarly with prealloc in hugetlb/file backed memory). I think this : bug requires that we touch *unpinned* device-dax regions unaligned to : the device-dax selected alignment (page size i.e. 4K/2M/1G) Link: https://lkml.kernel.org/r/23c02a03e8d666fef11bbe13e85c69c8b4ca0624.1727421694.git.llfl@linux.alibaba.com Fixes: b9b5777f09be ("device-dax: use ALIGN() for determining pgoff") Signed-off-by: Kun(llfl) <llfl@linux.alibaba.com> Tested-by: JianXiong Zhao <zhaojianxiong.zjx@alibaba-inc.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09kthread: unpark only parked kthreadFrederic Weisbecker
Calling into kthread unparking unconditionally is mostly harmless when the kthread is already unparked. The wake up is then simply ignored because the target is not in TASK_PARKED state. However if the kthread is per CPU, the wake up is preceded by a call to kthread_bind() which expects the task to be inactive and in TASK_PARKED state, which obviously isn't the case if it is unparked. As a result, calling kthread_stop() on an unparked per-cpu kthread triggers such a warning: WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525 <TASK> kthread_stop+0x17a/0x630 kernel/kthread.c:707 destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810 wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257 netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693 default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769 ops_exit_list net/core/net_namespace.c:178 [inline] cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd70 kernel/workqueue.c:3393 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fix this with skipping unecessary unparking while stopping a kthread. Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org Fixes: 5c25b5ff89f0 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Hillf Danton <hdanton@sina.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"Michal Hocko
This reverts commit eab0af905bfc3e9c05da2ca163d76a1513159aa4. There is no existing user of those flags. PF_MEMALLOC_NOWARN is dangerous because a nested allocation context can use GFP_NOFAIL which could cause unexpected failure. Such a code would be hard to maintain because it could be deeper in the call chain. PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] that such a allocation contex is inherently unsafe if the context doesn't fully control all allocations called from this context. While PF_MEMALLOC_NOWARN is not dangerous the way PF_MEMALLOC_NORECLAIM is it doesn't have any user and as Matthew has pointed out we are running out of those flags so better reclaim it without any real users. [1] https://lore.kernel.org/all/ZcM0xtlKbAOFjv5n@tiehlicka/ Link: https://lkml.kernel.org/r/20240926172940.167084-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Jan Kara <jack@suse.cz> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09bcachefs: do not use PF_MEMALLOC_NORECLAIMMichal Hocko
Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [akpm@linux-foundation.org: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09misc: sgi-gru: Don't disable preemption in GRU driverDimitri Sivanich
Disabling preemption in the GRU driver is unnecessary, and clashes with sleeping locks in several code paths. Remove preempt_disable and preempt_enable from the GRU driver. Signed-off-by: Dimitri Sivanich <sivanich@hpe.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-09Merge tag 'unicode-fixes-6.12-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode fix from Gabriel Krisman Bertazi: - Handle code-points with the Ignorable property as regular character instead of treating them as an empty string (me) * tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: Don't special case ignorable code points
2024-10-09unicode: Don't special case ignorable code pointsGabriel Krisman Bertazi
We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
2024-10-09ring-buffer: Do not have boot mapped buffers hook to CPU hotplugSteven Rostedt
The boot mapped ring buffer has its buffer mapped at a fixed location found at boot up. It is not dynamic. It cannot grow or be expanded when new CPUs come online. Do not hook fixed memory mapped ring buffers to the CPU hotplug callback, otherwise it can cause a crash when it tries to add the buffer to the memory that is already fully occupied. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home Fixes: be68d63a139bd ("ring-buffer: Add ring_buffer_alloc_range()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09net: mana: Enable debugfs files for MANA deviceShradha Gupta
Implement debugfs in MANA driver to be able to view RX,TX,EQ queue specific attributes and dump their gdma queues. These dumps can be used by other userspace utilities to improve debuggability and troubleshooting Following files are added in debugfs: /sys/kernel/debug/mana/ |-------------- 1 |--------------- EQs | |------- eq0 | | |---head | | |---tail | | |---eq_dump | |------- eq1 | . | . | |--------------- adapter-MTU |--------------- vport0 |------- RX-0 | |---cq_budget | |---cq_dump | |---cq_head | |---cq_tail | |---rq_head | |---rq_nbuf | |---rq_tail | |---rxq_dump |------- RX-1 . . |------- TX-0 | |---cq_budget | |---cq_dump | |---cq_head | |---cq_tail | |---sq_head | |---sq_pend_skb_qlen | |---sq_tail | |---txq_dump |------- TX-1 . . Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: hns3/hns: Update the maintainer for the HNS3/HNS ethernet driverJijie Shao
Yisen Zhuang has left the company in September. Jian Shen will be responsible for maintaining the hns3/hns driver's code in the future, so add Jian Shen to the hns3/hns driver's matainer list. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09r8169: add support for the temperature sensor being available from RTL8125BHeiner Kallweit
This adds support for the temperature sensor being available from RTL8125B. Register information was taken from r8125 vendor driver. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09sctp: ensure sk_state is set to CLOSED if hashing fails in sctp_listen_startXin Long
If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the hash table. This can lead to a scenario where a socket appears to be listening without actually being accessible. This patch ensures that if the hashing operation fails, the sk_state is set back to CLOSED before returning an error. Note that there is no need to undo the autobind operation if hashing fails, as the bind port can still be used for next listen() call on the same socket. Fixes: 76c6d988aeb3 ("sctp: add sock_reuseport for the sock in __sctp_hash_endpoint") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09Merge branch 'net-improve-multicast-group-join-performance'David S. Miller
Jonas Rebmann says: ==================== improve multicast join group performance This series seeks to improve performance on updating igmp group memberships such as with IP_ADD_MEMBERSHIP or MCAST_JOIN_SOURCE_GROUP. Our use case was to add 2000 multicast memberships on a TQMLS1046A which took about 3.6 seconds for the membership additions alone. Our userspace reproducer tool was instrumented to log runtimes of the individual setsockopt invocations which clearly indicated quadratic complexity of setting up the membership with regard to the total number of multicast groups to be joined. We used perf to locate the hotspots and subsequently optimized the most costly sections of code. This series includes a patch to Linux igmp handling as well as a patch to the DPAA/Freescale driver. With both patches applied, our memberships can be set up in only about 87 miliseconds, which corresponds to a speedup of around 40. While we have acheived practically linear run-time complexity on the kernel side, a small quadratic factor remains in parts of the freescale driver code which we haven't yet optimized. We have by now payed little attention to the optimization potential in dropping group memberships, yet the dpaa patch applies to joining and leaving groups alike. Overall, this patch series brings great improvements in use cases involving large numbers of multicast groups, particularly when using the fsl_dpa driver, without noteworthy drawbacks in other scenarios. ==================== Signed-off-by: Jonas Rebmann <jre@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: dpaa: use __dev_mc_sync in dpaa_set_rx_mode()Jonas Rebmann
The original driver first unregisters then re-registers all multicast addresses in the struct net_device_ops::ndo_set_rx_mode() callback. As the networking stack calls ndo_set_rx_mode() if a single multicast address change occurs, a significant amount of time may be used to first unregister and then re-register unchanged multicast addresses. This leads to performance issues when tracking large numbers of multicast addresses. Replace the unregister and register loop and the hand crafted mc_addr_list list handling with __dev_mc_sync(), to only update entries which have changed. On profiling with an fsl_dpa NIC, this patch presented a speedup of around 40 when successively setting up 2000 multicast groups using setsockopt(), without drawbacks on smaller numbers of multicast groups. Signed-off-by: Jonas Rebmann <jre@pengutronix.de> Reviewed-by: Sean Anderson <sean.anderson@seco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: ipv4: igmp: optimize ____ip_mc_inc_group() using mc_hashJonas Rebmann
The runtime cost of joining a single multicast group in the current implementation of ____ip_mc_inc_group grows linearly with the number of existing memberships. This is caused by the linear search for an existing group record in the multicast address list. This linear complexity results in quadratic complexity when successively adding memberships, which becomes a performance bottleneck when setting up large numbers of multicast memberships. If available, use the existing multicast hash map mc_hash to quickly search for an existing group membership record. This leads to near-constant complexity on the addition of a new multicast record, significantly improving performance for workloads involving many multicast memberships. On profiling with a loopback device, this patch presented a speedup of around 6 when successively setting up 2000 multicast groups using setsockopt without measurable drawbacks on smaller numbers of multicast groups. Signed-off-by: Jonas Rebmann <jre@pengutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: amd: mvme147: Fix probe banner messageDaniel Palmer
Currently this driver prints this line with what looks like a rogue format specifier when the device is probed: [ 2.840000] eth%d: MVME147 at 0xfffe1800, irq 12, Hardware Address xx:xx:xx:xx:xx:xx Change the printk() for netdev_info() and move it after the registration has completed so it prints out the name of the interface properly. Signed-off-by: Daniel Palmer <daniel@0x0f.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: phy: realtek: Fix MMD access on RTL8126A-integrated PHYHeiner Kallweit
All MMD reads return 0 for the RTL8126A-integrated PHY. Therefore phylib assumes it doesn't support EEE, what results in higher power consumption, and a significantly higher chip temperature in my case. To fix this split out the PHY driver for the RTL8126A-integrated PHY and set the read_mmd/write_mmd callbacks to read from vendor-specific registers. Fixes: 5befa3728b85 ("net: phy: realtek: add support for RTL8126A-integrated 5Gbps PHY") Cc: stable@vger.kernel.org Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09btrfs: fix clear_dirty and writeback ordering in submit_one_sector()Naohiro Aota
This commit is a replay of commit 6252690f7e1b ("btrfs: fix invalid mapping of extent xarray state"). We need to call btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that xarray DIRTY tag is cleared. With a refactoring commit 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission"), it screwed up and the order is reversed and causing the same hang. Fix the ordering now in submit_one_sector(). Fixes: 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-09btrfs: zoned: fix missing RCU locking in error message when loading zone infoFilipe Manana
At btrfs_load_zone_info() we have an error path that is dereferencing the name of a device which is a RCU string but we are not holding a RCU read lock, which is incorrect. Fix this by using btrfs_err_in_rcu() instead of btrfs_err(). The problem is there since commit 08e11a3db098 ("btrfs: zoned: load zone's allocation offset"), back then at btrfs_load_block_group_zone_info() but then later on that code was factored out into the helper btrfs_load_zone_info() by commit 09a46725cc84 ("btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info"). Fixes: 08e11a3db098 ("btrfs: zoned: load zone's allocation offset") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-09net: ti: icssg-prueth: Fix race condition for VLAN table accessMD Danish Anwar
The VLAN table is a shared memory between the two ports/slices in a ICSSG cluster and this may lead to race condition when the common code paths for both ports are executed in different CPUs. Fix the race condition access by locking the shared memory access Fixes: 487f7323f39a ("net: ti: icssg-prueth: Add helper functions to configure FDB") Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09ptp: Add support for the AMZNC10C 'vmclock' deviceDavid Woodhouse
The vmclock device addresses the problem of live migration with precision clocks. The tolerances of a hardware counter (e.g. TSC) are typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that counter against an external source of 'real' time, and track the precise frequency of the counter as it changes with environmental conditions. When a guest is live migrated, anything it knows about the frequency of the underlying counter becomes invalid. It may move from a host where the counter running at -50PPM of its nominal frequency, to a host where it runs at +50PPM. There will also be a step change in the value of the counter, as the correctness of its absolute value at migration is limited by the accuracy of the source and destination host's time synchronization. In its simplest form, the device merely advertises a 'disruption_marker' which indicates that the guest should throw away any NTP synchronization it thinks it has, and start again. Because the shared memory region can be exposed all the way to userspace through the /dev/vmclock0 node, applications can still use time from a fast vDSO 'system call', and check the disruption marker to be sure that their timestamp is indeed truthful. The structure also allows for the precise time, as known by the host, to be exposed directly to guests so that they don't have to wait for NTP to resync from scratch. The PTP driver consumes this information if present. Like the KVM PTP clock, this PTP driver can convert TSC-based cross timestamps into KVM clock values. Unlike the KVM PTP clock, it does so only when such is actually helpful. The values and fields are based on the nascent virtio-rtc specification, and the intent is that a version (hopefully precisely this version) of this structure will be included as an optional part of that spec. In the meantime, this driver supports the simple ACPI form of the device which is being shipped in certain commercial hypervisors (and submitted for inclusion in QEMU). Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09Merge branch 'pcs-xpcs-cleanups-batch-2'David S. Miller
Russell King says: ==================== net: pcs: xpcs: cleanups batch 2 This is the second cleanup series for XPCS. Patch 1 removes the enum indexing the dw_xpcs_compat array. The index is never used except to place entries in the array and to size the array. Patch 2 removes the interface arrays - each of which only contain one interface. Patch 3 makes xpcs_find_compat() take the xpcs structure rather than the ID - the previous series removed the reason for xpcs_find_compat needing to take the ID. Patch 4 provides a helper to convert xpcs structure to a regular phylink_pcs structure, which leads to patch 5. Patch 5 moves the definition of struct dw_xpcs to the private xpcs header - with patch 4 in place, nothing outside of the xpcs driver accesses the contents of the dw_xpcs structure. Patch 6 renames xpcs_get_id() to xpcs_read_id() since it's reading the ID, rather than doing anything further with it. (Prior versions of this series renamed it to xpcs_read_phys_id() since that more accurately described that it was reading the physical ID registers.) Patch 7 moves the searching of the ID list out of line as this is a separate functional block. Patch 8 converts xpcs to use the bitmap macros, which eliminates the need for _SHIFT definitions. Patch 9 adds and uses _modify() accessors as there are a large amount of read-modify-write operations in this driver. This conversion found a bug in xpcs-wx code that has been reported and already fixed. Patch 10 converts xpcs to use read_poll_timeout() rather than open coding that. Patch 11 converts all printed messages to use the dev_*() functions so the driver and devie name are always printed. Patch 12 moves DW_VR_MII_DIG_CTRL1_2G5_EN to the correct place in the header file, rather than amongst another register's definitions. Patch 13 moves the Wangxun workaround to a common location rather than duplicating it in two places. We also reformat this to fit within 80 columns. ==================== Tested-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: move Wangxun VR_XS_PCS_DIG_CTRL1 configurationRussell King (Oracle)
According to commits 2a22b7ae2fa3 ("net: pcs: xpcs: adapt Wangxun NICs for SGMII mode") and 2deea43f386d ("net: pcs: xpcs: add 1000BASE-X AN interrupt support"), Wangxun devices need special VR_XS_PCS_DIG_CTRL1 settings for SGMII and 1000BASE-X. Both SGMII and 1000BASE-X use the same settings. Rather than placing these in the individual xpcs_config_*() functions, move it to where we already test for the Wangxun devices in xpcs_do_config(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: correctly place DW_VR_MII_DIG_CTRL1_2G5_ENRussell King (Oracle)
Place DW_VR_MII_DIG_CTRL1_2G5_EN with the other DW_VR_MII_DIG_CTRL1 definitions rather than in the middle of a register list. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: use dev_*() to print messagesRussell King (Oracle)
Use the dev_*() family of functions to print all messages from the XPCS driver so we know which instance issues the messages. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: convert to use read_poll_timeout()Russell King (Oracle)
Convert the xpcs driver to use read_poll_timeout() when waiting for reset to complete, rather than open-coding this. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: add _modify() accessorsRussell King (Oracle)
The xpcs driver does a lot of read-modify-write operations on registers, which leads to long-winded code to read the register, check whether the read was successful, modify the value in some way, and then write it back. We have a mdiodev _modify() accessor that encapsulates this, and does the register modification under the MDIO bus lock ensuring that the modification is atomic with respect to other bus operations. Convert the xpcs driver to use this accessor. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: use FIELD_PREP() and FIELD_GET()Russell King (Oracle)
Convert xpcs to use the bitfield macros rather than definining the bitfield shifts and open-coding the insertion and extraction of these bitfields. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: move searching ID list out of lineRussell King (Oracle)
Move the searching of the physical ID out of xpcs_create() and into its own xpcs_identify() function, which makes it self contained. This reduces the complexity in xpcs_craete(), making it easier to follow, rather than having a lot of once-run code in the big for() loop. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: rename xpcs_get_id()Russell King (Oracle)
Rename xpcs_get_id() to xpcs_read_id() which more closely reflects the purpose of this function. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: move definition of struct dw_xpcs to private headerRussell King (Oracle)
There should be no reason for anything outside the XPCS code to know the contents of struct dw_xpcs - this is a private structure to XPCS. Move the definition to the private pcs-xpcs.h header, leaving a declaration in the global pcs/pcs-xpcs.h Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: provide a helper to get the phylink pcs given xpcsRussell King (Oracle)
Provide a helper to provide the pointer to the phylink_pcs struct given a valid xpcs pointer. This will be necessary when we make struct dw_xpcs private to pcs-xpcs.c Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: pass xpcs instead of xpcs->id to xpcs_find_compat()Russell King (Oracle)
xpcs_find_compat() is now always passed xpcs->id. Rather than always dereferencing this in the caller, move it into xpcs_find_compat(), thus making this function consistent with most of the other xpcs functions in taking an xpcs pointer. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: don't use array for interfaceRussell King (Oracle)
Currently, xpcs uses an array of interfaces that each "compat" entry supports. When looking up the compat entry for an interface, we iterate over the compat entries and then over each interface. Since each compat entry only has a single interface in its interfaces array, replace the array with a single member in the compat structure. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09net: pcs: xpcs: remove dw_xpcs_compat enumRussell King (Oracle)
There is no reason for the struct dw_xpcs_compat arrays to be a fixed size other than the way we iterate over them. The index into the array isn't used for anything, and having them fixed size needlessly wastes space. Remove the enum that defines their size, and instead use an empty array entry (with NULL ->supported) to mark the end of the array. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09xfs: fix a typoAndrew Kreimer
Fix a typo in comments. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-09xfs: don't free cowblocks from under dirty pagecache on unshareBrian Foster
fallocate unshare mode explicitly breaks extent sharing. When a command completes, it checks the data fork for any remaining shared extents to determine whether the reflink inode flag and COW fork preallocation can be removed. This logic doesn't consider in-core pagecache and I/O state, however, which means we can unsafely remove COW fork blocks that are still needed under certain conditions. For example, consider the following command sequence: xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \ -c "pwrite 0 32k" -c "funshare 0 1k" <file> This allocates a data block at offset 0, shares it, and then overwrites it with a larger buffered write. The overwrite triggers COW fork preallocation, 32 blocks by default, which maps the entire 32k write to delalloc in the COW fork. All but the shared block at offset 0 remains hole mapped in the data fork. The unshare command redirties and flushes the folio at offset 0, removing the only shared extent from the inode. Since the inode no longer maps shared extents, unshare purges the COW fork before the remaining 28k may have written back. This leaves dirty pagecache backed by holes, which writeback quietly skips, thus leaving clean, non-zeroed pagecache over holes in the file. To verify, fiemap shows holes in the first 32k of the file and reads return different data across a remount: $ xfs_io -c "fiemap -v" <file> <file>: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS ... 1: [8..511]: hole 504 ... $ xfs_io -c "pread -v 4k 8" <file> 00001000: cd cd cd cd cd cd cd cd ........ $ umount <mnt>; mount <dev> <mnt> $ xfs_io -c "pread -v 4k 8" <file> 00001000: 00 00 00 00 00 00 00 00 ........ To avoid this problem, make unshare follow the same rules used for background cowblock scanning and never purge the COW fork for inodes with dirty pagecache or in-flight I/O. Fixes: 46afb0628b86347 ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-08net: ibm: emac: mal: fix wrong gotoRosen Penev
dcr_map is called in the previous if and therefore needs to be unmapped. Fixes: 1ff0fcfcb1a6 ("ibm_newemac: Fix new MAL feature handling") Signed-off-by: Rosen Penev <rosenp@gmail.com> Link: https://patch.msgid.link/20241007235711.5714-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: phy: microchip_t1: SQI support for LAN887xTarun Alle
Add support for measuring Signal Quality Index for LAN887x T1 PHY. Signal Quality Index (SQI) is measure of Link Channel Quality from 0 to 7, with 7 as the best. By default, a link loss event shall indicate an SQI of 0. Signed-off-by: Tarun Alle <Tarun.Alle@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20241007063943.3233-1-tarun.alle@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>