summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-11nexthop: Only parse NHA_OP_FLAGS for get messages that require itIdo Schimmel
The attribute is parsed into 'op_flags' in nh_valid_get_del_req() which is called from the handlers of three message types: RTM_DELNEXTHOP, RTM_GETNEXTHOPBUCKET and RTM_GETNEXTHOP. The attribute is only used by the latter and rejected by the policies of the other two. Pass 'op_flags' as NULL from the handlers of the other two and only parse the attribute when the argument is not NULL. This is a preparation for a subsequent patch. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240311162307.545385-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'x86_tdx_for_6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tdx update from Dave Hansen: - Fix sparse warning from TDX use of movdir64b() * tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Remove the __iomem annotation of movdir64b()'s dst argument
2024-03-11Merge tag 'x86_mm_for_6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: - Add a warning when memory encryption conversions fail. These operations require VMM cooperation, even in CoCo environments where the VMM is untrusted. While it's _possible_ that memory pressure could trigger the new warning, the odds are that a guest would only see this from an attacking VMM. - Simplify page fault code by re-enabling interrupts unconditionally - Avoid truncation issues when pfns are passed in to pfn_to_kaddr() with small (<64-bit) types. * tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails x86/mm: Get rid of conditional IF flag handling in page fault path x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'x86-cleanups-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups, including a large series from Thomas Gleixner to cure sparse warnings" * tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Drop unused declaration of proc_nmi_enabled() x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() x86/percpu: Cure per CPU madness on UP smp: Consolidate smp_prepare_boot_cpu() x86/msr: Add missing __percpu annotations x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h> perf/x86/amd/uncore: Fix __percpu annotation x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP) x86/apm_32: Remove dead function apm_get_battery_status() x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11Merge tag 'x86-build-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build updates from Ingo Molnar: - Reduce <asm/bootparam.h> dependencies - Simplify <asm/efi.h> - Unify *_setup_data definitions into <asm/setup_data.h> - Reduce the size of <asm/bootparam.h> * tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Do not include <asm/bootparam.h> in several files x86/efi: Implement arch_ima_efi_boot_mode() in source file x86/setup: Move internal setup_data structures into setup_data.h x86/setup: Move UAPI setup structures into setup_data.h
2024-03-11Merge tag 'x86-asm-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "Two changes to simplify the x86 decoder logic a bit" * tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/insn: Directly assign x86_64 state in insn_init() x86/insn: Remove superfluous checks from instruction decoding routines
2024-03-11Merge tag 'sched-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Fix inconsistency in misfit task load-balancing - Fix CPU isolation bugs in the task-wakeup logic - Rework and unify the sched_use_asym_prio() and sched_asym_prefer() logic - Clean up and simplify ->avg_* accesses - Misc cleanups and fixes * tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio() sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer() sched/fair: Remove unused parameter from sched_asym() sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS sched/fair: Simplify the update_sd_pick_busiest() logic sched/fair: Do strict inequality check for busiest misfit task group sched/fair: Remove unnecessary goto in update_sd_lb_stats() sched/fair: Take the scheduling domain into account in select_idle_core() sched/fair: Take the scheduling domain into account in select_idle_smt() sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl sched/core: Simplify code by removing duplicate #ifdefs
2024-03-11Merge tag 'locking-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Micro-optimize local_xchg() and the rtmutex code on x86 - Fix percpu-rwsem contention tracepoints - Simplify debugging Kconfig dependencies - Update/clarify the documentation of atomic primitives - Misc cleanups * tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters() locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix locking/percpu-rwsem: Trigger contention tracepoints only if contended locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint locking/mutex: Simplify <linux/mutex.h> locking/qspinlock: Fix 'wait_early' set but not used warning locking/atomic: scripts: Clarify ordering of conditional atomics
2024-03-11Merge tag 'edac_updates_for_v6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups * tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Convert to platform remove callback returning void RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide EDAC/versal: Make the bit position of injected errors configurable EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-03-11Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2024-03-11 We've added 59 non-merge commits during the last 9 day(s) which contain a total of 88 files changed, 4181 insertions(+), 590 deletions(-). The main changes are: 1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena, from Alexei. 2) Introduce bpf_arena which is sparse shared memory region between bpf program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and bpf programs, from Alexei and Andrii. 3) Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it, from Alexei. 4) Use IETF format for field definitions in the BPF standard document, from Dave. 5) Extend struct_ops libbpf APIs to allow specify version suffixes for stuct_ops map types, share the same BPF program between several map definitions, and other improvements, from Eduard. 6) Enable struct_ops support for more than one page in trampolines, from Kui-Feng. 7) Support kCFI + BPF on riscv64, from Puranjay. 8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay. 9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke. ==================== Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'x86_misc_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Borislav Petkov: - Fix a wrong check in the function reporting whether a CPU executes (or not) a NMI handler - Ratelimit unknown NMIs messages in order to not potentially slow down the machine - Other fixlets * tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Fix the inverse "in NMI handler" check Documentation/maintainer-tip: Add C++ tail comments exception Documentation/maintainer-tip: Add Closes tag x86/nmi: Rate limit unknown NMI messages Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
2024-03-11Merge tag 'x86_sev_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ...
2024-03-11Merge tag 'x86_cache_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull resource control updates from Borislav Petkov: - Rework different aspects of the resctrl code like adding arch-specific accessors and splitting the locking, in order to accomodate ARM's MPAM implementation of hw resource control and be able to use the same filesystem control interface like on x86. Work by James Morse - Improve the memory bandwidth throttling heuristic to handle workloads with not too regular load levels which end up penalized unnecessarily - Use CPUID to detect the memory bandwidth enforcement limit on AMD - The usual set of fixes * tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/resctrl: Remove lockdep annotation that triggers false positive x86/resctrl: Separate arch and fs resctrl locks x86/resctrl: Move domain helper migration into resctrl_offline_cpu() x86/resctrl: Add CPU offline callback for resctrl work x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU x86/resctrl: Add CPU online callback for resctrl work x86/resctrl: Add helpers for system wide mon/alloc capable x86/resctrl: Make rdt_enable_key the arch's decision to switch x86/resctrl: Move alloc/mon static keys into helpers x86/resctrl: Make resctrl_mounted checks explicit x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() x86/resctrl: Allow resctrl_arch_rmid_read() to sleep x86/resctrl: Queue mon_event_read() instead of sending an IPI x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow x86/resctrl: Move CLOSID/RMID matching and setting to use helpers x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding x86/resctrl: Track the number of dirty RMID a CLOSID has x86/resctrl: Allow RMID allocation to be scoped by CLOSID x86/resctrl: Access per-rmid structures by index ...
2024-03-11Merge tag 'x86_mtrr_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MTRR update from Borislav Petkov: - Relax the PAT MSR programming which was unnecessarily using the MTRR programming protocol of disabling the cache around the changes. The reason behind this is the current algorithm triggering a #VE exception for TDX guests and unnecessarily complicating things * tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pat: Simplify the PAT programming protocol
2024-03-11Merge tag 'x86_cpu_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu update from Borislav Petkov: - Have AMD Zen common init code run on all families from Zen1 onwards in order to save some future enablement effort * tag 'x86_cpu_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Do the common init on future Zens too
2024-03-11Merge tag 'ras_core_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixlet from Borislav Petkov: - Constify yet another static struct bus_type instance now that the driver core can handle that * tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Make mce_subsys const
2024-03-11Revert "dm: use queue_limits_set"Linus Torvalds
This reverts commit 8e0ef412869430d114158fc3b9b1fb111e247bd3. It's broken, and causes the boot to fail on encrypted volumes. Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/ Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11bpf: move sleepable flag from bpf_prog_aux to bpf_progAndrii Nakryiko
prog->aux->sleepable is checked very frequently as part of (some) BPF program run hot paths. So this extra aux indirection seems wasteful and on busy systems might cause unnecessary memory cache misses. Let's move sleepable flag into prog itself to eliminate unnecessary pointer dereference. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Message-ID: <20240309004739.2961431-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()Puranjay Mohan
On some architectures like ARM64, PMD_SIZE can be really large in some configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is 512MB. Use 2MB * num_possible_nodes() as the size for allocations done through the prog pack allocator. On most architectures, PMD_SIZE will be equal to 2MB in case of 4KB pages and will be greater than 2MB for bigger page sizes. Fixes: ea2babac63d4 ("bpf: Simplify bpf_prog_pack_[size|mask]") Reported-by: "kernelci.org bot" <bot@kernelci.org> Closes: https://lore.kernel.org/all/7e216c88-77ee-47b8-becc-a0f780868d3c@sirena.org.uk/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403092219.dhgcuz2G-lkp@intel.com/ Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Message-ID: <20240311122722.86232-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11Merge tag 'x86-entry-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry update from Thomas Gleixner: "A single update for the x86 entry code: The current CR3 handling for kernel page table isolation in the paranoid return paths which are relevant for #NMI, #MCE, #VC, #DB and #DF is unconditionally writing CR3 with the value retrieved on exception entry. In the vast majority of cases when returning to the kernel this is a pointless exercise because CR3 was not modified on exception entry. The only situation where this is necessary is when the exception interrupts a entry from user before switching to kernel CR3 or interrupts an exit to user after switching back to user CR3. As CR3 writes can be expensive on some systems this becomes measurable overhead with high frequency #NMIs such as perf. Avoid this overhead by checking the CR3 value, which was saved on entry, and write it back to CR3 only when it is a user CR3" * tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Avoid redundant CR3 write on paranoid returns
2024-03-11selftests/bpf: Add kprobe multi triggering benchmarksJiri Olsa
Adding kprobe multi triggering benchmarks. It's useful now to bench new fprobe implementation and might be useful later as well. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240311211023.590321-1-jolsa@kernel.org
2024-03-11ptp: Move from simple ida to xarrayKory Maincent
Move from simple ida to xarray for storing and loading the ptp_clock pointer. This prepares support for future hardware timestamp selection by being able to link the ptp clock index to its pointer. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20240311144730.1239594-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11vxlan: Remove generic .ndo_get_stats64Breno Leitao
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://lore.kernel.org/r/20240311112437.3813987-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11vxlan: Do not alloc tstats manuallyBreno Leitao
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the vxlan driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://lore.kernel.org/r/20240311112437.3813987-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11devlink: Add comments to use netlink gen toolWilliam Tu
Add the comment to remind people not to manually modify the net/devlink/netlink_gen.c, but to use tools/net/ynl/ynl-regen.sh to generate it. Signed-off-by: William Tu <witu@nvidia.com> Suggested-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240310145503.32721-1-witu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11nfp: flower: handle acti_netdevs allocation failureDuoming Zhou
The kmalloc_array() in nfp_fl_lag_do_work() will return null, if the physical memory has run out. As a result, if we dereference the acti_netdevs, the null pointer dereference bugs will happen. This patch adds a check to judge whether allocation failure occurs. If it happens, the delayed work will be rescheduled and try again. Fixes: bb9a8d031140 ("nfp: flower: monitor and offload LAG groups") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Louis Peens <louis.peens@corigine.com> Link: https://lore.kernel.org/r/20240308142540.9674-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net/packet: Add getsockopt support for PACKET_COPY_THRESHJuntong Deng
Currently getsockopt does not support PACKET_COPY_THRESH, and we are unable to get the value of PACKET_COPY_THRESH socket option through getsockopt. This patch adds getsockopt support for PACKET_COPY_THRESH. In addition, this patch converts access to copy_thresh to READ_ONCE/WRITE_ONCE. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/AM6PR03MB58487A9704FD150CF76F542899272@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSIDJuntong Deng
Currently getsockopt does not support NETLINK_LISTEN_ALL_NSID, and we are unable to get the value of NETLINK_LISTEN_ALL_NSID socket option through getsockopt. This patch adds getsockopt support for NETLINK_LISTEN_ALL_NSID. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Link: https://lore.kernel.org/r/AM6PR03MB58482322B7B335308DA56FE599272@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...
2024-03-11Merge branch 'bpf-introduce-bpf-arena'Andrii Nakryiko
Alexei Starovoitov says: ==================== bpf: Introduce BPF arena. From: Alexei Starovoitov <ast@kernel.org> v2->v3: - contains bpf bits only, but cc-ing past audience for continuity - since prerequisite patches landed, this series focus on the main functionality of bpf_arena. - adopted Andrii's approach to support arena in libbpf. - simplified LLVM support. Instead of two instructions it's now only one. - switched to cond_break (instead of open coded iters) in selftests - implemented several follow-ups that will be sent after this set . remember first IP and bpf insn that faulted in arena. report to user space via bpftool . copy paste and tweak glob_match() aka mini-regex as a selftests/bpf - see patch 1 for detailed description of bpf_arena v1->v2: - Improved commit log with reasons for using vmap_pages_range() in arena. Thanks to Johannes - Added support for __arena global variables in bpf programs - Fixed race conditions spotted by Barret - Fixed wrap32 issue spotted by Barret - Fixed bpf_map_mmap_sz() the way Andrii suggested The work on bpf_arena was inspired by Barret's work: https://github.com/google/ghost-userspace/blob/main/lib/queue.bpf.h that implements queues, lists and AVL trees completely as bpf programs using giant bpf array map and integer indices instead of pointers. bpf_arena is a sparse array that allows to use normal C pointers to build such data structures. Last few patches implement page_frag allocator, link list and hash table as bpf programs. v1: bpf programs have multiple options to communicate with user space: - Various ring buffers (perf, ftrace, bpf): The data is streamed unidirectionally from bpf to user space. - Hash map: The bpf program populates elements, and user space consumes them via bpf syscall. - mmap()-ed array map: Libbpf creates an array map that is directly accessed by the bpf program and mmap-ed to user space. It's the fastest way. Its disadvantage is that memory for the whole array is reserved at the start. ==================== Link: https://lore.kernel.org/r/20240308010812.89848-1-alexei.starovoitov@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-03-11selftests/bpf: Add bpf_arena_htab test.Alexei Starovoitov
bpf_arena_htab.h - hash table implemented as bpf program Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-15-alexei.starovoitov@gmail.com
2024-03-11selftests/bpf: Add bpf_arena_list test.Alexei Starovoitov
bpf_arena_alloc.h - implements page_frag allocator as a bpf program. bpf_arena_list.h - doubly linked link list as a bpf program. Compiled as a bpf program and as native C code. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-14-alexei.starovoitov@gmail.com
2024-03-11selftests/bpf: Add unit tests for bpf_arena_alloc/free_pagesAlexei Starovoitov
Add unit tests for bpf_arena_alloc/free_pages() functionality and bpf_arena_common.h with a set of common helpers and macros that is used in this test and the following patches. Also modify test_loader that didn't support running bpf_prog_type_syscall programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-13-alexei.starovoitov@gmail.com
2024-03-11bpf: Add helper macro bpf_addr_space_cast()Alexei Starovoitov
Introduce helper macro bpf_addr_space_cast() that emits: rX = rX instruction with off = BPF_ADDR_SPACE_CAST and encodes dest and src address_space-s into imm32. It's useful with older LLVM that doesn't emit this insn automatically. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-12-alexei.starovoitov@gmail.com
2024-03-11libbpf: Recognize __arena global variables.Andrii Nakryiko
LLVM automatically places __arena variables into ".arena.1" ELF section. In order to use such global variables bpf program must include definition of arena map in ".maps" section, like: struct { __uint(type, BPF_MAP_TYPE_ARENA); __uint(map_flags, BPF_F_MMAPABLE); __uint(max_entries, 1000); /* number of pages */ __ulong(map_extra, 2ull << 44); /* start of mmap() region */ } arena SEC(".maps"); libbpf recognizes both uses of arena and creates single `struct bpf_map *` instance in libbpf APIs. ".arena.1" ELF section data is used as initial data image, which is exposed through skeleton and bpf_map__initial_value() to the user, if they need to tune it before the load phase. During load phase, this initial image is copied over into mmap()'ed region corresponding to arena, and discarded. Few small checks here and there had to be added to make sure this approach works with bpf_map__initial_value(), mostly due to hard-coded assumption that map->mmaped is set up with mmap() syscall and should be munmap()'ed. For arena, .arena.1 can be (much) smaller than maximum arena size, so this smaller data size has to be tracked separately. Given it is enforced that there is only one arena for entire bpf_object instance, we just keep it in a separate field. This can be generalized if necessary later. All global variables from ".arena.1" section are accessible from user space via skel->arena->name_of_var. For bss/data/rodata the skeleton/libbpf perform the following sequence: 1. addr = mmap(MAP_ANONYMOUS) 2. user space optionally modifies global vars 3. map_fd = bpf_create_map() 4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel 5. mmap(addr, MAP_FIXED, map_fd) after step 5 user spaces see the values it wrote at step 2 at the same addresses arena doesn't support update_map_elem. Hence skeleton/libbpf do: 1. addr = malloc(sizeof SEC ".arena.1") 2. user space optionally modifies global vars 3. map_fd = bpf_create_map(MAP_TYPE_ARENA) 4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd) 5. memcpy(real_addr, addr) // this will fault-in and allocate pages At the end look and feel of global data vs __arena global data is the same from bpf prog pov. Another complication is: struct { __uint(type, BPF_MAP_TYPE_ARENA); } arena SEC(".maps"); int __arena foo; int bar; ptr1 = &foo; // relocation against ".arena.1" section ptr2 = &arena; // relocation against ".maps" section ptr3 = &bar; // relocation against ".bss" section Fo the kernel ptr1 and ptr2 has point to the same arena's map_fd while ptr3 points to a different global array's map_fd. For the verifier: ptr1->type == unknown_scalar ptr2->type == const_ptr_to_map ptr3->type == ptr_to_map_value After verification, from JIT pov all 3 ptr-s are normal ld_imm64 insns. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-11-alexei.starovoitov@gmail.com
2024-03-11bpftool: Recognize arena map typeAlexei Starovoitov
Teach bpftool to recognize arena map type. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-10-alexei.starovoitov@gmail.com
2024-03-11libbpf: Add support for bpf_arena.Alexei Starovoitov
mmap() bpf_arena right after creation, since the kernel needs to remember the address returned from mmap. This is user_vm_start. LLVM will generate bpf_arena_cast_user() instructions where necessary and JIT will add upper 32-bit of user_vm_start to such pointers. Fix up bpf_map_mmap_sz() to compute mmap size as map->value_size * map->max_entries for arrays and PAGE_SIZE * map->max_entries for arena. Don't set BTF at arena creation time, since it doesn't support it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-9-alexei.starovoitov@gmail.com
2024-03-11libbpf: Add __arg_arena to bpf_helpers.hAlexei Starovoitov
Add __arg_arena to bpf_helpers.h Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-8-alexei.starovoitov@gmail.com
2024-03-11bpf: Recognize btf_decl_tag("arg: Arena") as PTR_TO_ARENA.Alexei Starovoitov
In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA. Note, when the verifier sees: __weak void foo(struct bar *p) it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars. Hence the only way to use arena pointers in global functions is to tag them with "arg:arena". Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-7-alexei.starovoitov@gmail.com
2024-03-11bpf: Recognize addr_space_cast instruction in the verifier.Alexei Starovoitov
rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have to be in 32-bit domain. The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as kern_vm_start + 32bit_addr memory accesses. rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well. Otherwise JIT will convert it to: rY = (u32)rX; if (rY) rY |= arena->user_vm_start & ~(u64)~0U; Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-6-alexei.starovoitov@gmail.com
2024-03-11bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.Alexei Starovoitov
LLVM generates bpf_addr_space_cast instruction while translating pointers between native (zero) address space and __attribute__((address_space(N))). The addr_space=1 is reserved as bpf_arena address space. rY = addr_space_cast(rX, 0, 1) is processed by the verifier and converted to normal 32-bit move: wX = wY rY = addr_space_cast(rX, 1, 0) has to be converted by JIT: aux_reg = upper_32_bits of arena->user_vm_start aux_reg <<= 32 wX = wY // clear upper 32 bits of dst register if (wX) // if not zero add upper bits of user_vm_start wX |= aux_reg JIT can do it more efficiently: mov dst_reg32, src_reg32 // 32-bit move shl dst_reg, 32 or dst_reg, user_vm_start rol dst_reg, 32 xor r11, r11 test dst_reg32, dst_reg32 // check if lower 32-bit are zero cmove r11, dst_reg // if so, set dst_reg to zero // Intel swapped src/dst register encoding in CMOVcc Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-5-alexei.starovoitov@gmail.com
2024-03-11bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.Alexei Starovoitov
Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions. They are similar to PROBE_MEM instructions with the following differences: - PROBE_MEM has to check that the address is in the kernel range with src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check - PROBE_MEM doesn't support store - PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register - PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue) Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed to be within arena virtual range, so no address check at run-time. - PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When LDX faults the destination register is zeroed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-4-alexei.starovoitov@gmail.com
2024-03-11bpf: Disasm support for addr_space_cast instruction.Alexei Starovoitov
LLVM generates rX = addr_space_cast(rY, dst_addr_space, src_addr_space) instruction when pointers in non-zero address space are used by the bpf program. Recognize this insn in uapi and in bpf disassembler. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-3-alexei.starovoitov@gmail.com
2024-03-11bpf: Introduce bpf_arena.Alexei Starovoitov
Introduce bpf_arena, which is a sparse shared memory region between the bpf program and user space. Use cases: 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous region, like memcached or any key/value storage. The bpf program implements an in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a value without going to user space. 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables, rb-trees, sparse arrays), while user space consumes it. 3. bpf_arena is a "heap" of memory from the bpf program's point of view. The user space may mmap it, but bpf program will not convert pointers to user base at run-time to improve bpf program speed. Initially, the kernel vm_area and user vma are not populated. User space can fault in pages within the range. While servicing a page fault, bpf_arena logic will insert a new page into the kernel and user vmas. The bpf program can allocate pages from that region via bpf_arena_alloc_pages(). This kernel function will insert pages into the kernel vm_area. The subsequent fault-in from user space will populate that page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in from user space. In such a case, if a page is not allocated by the bpf program and not present in the kernel vm_area, the user process will segfault. This is useful for use cases 2 and 3 above. bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages either at a specific address within the arena or allocates a range with the maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages and removes the range from the kernel vm_area and from user process vmas. bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf program is more important than ease of sharing with user space. This is use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is not NULL) to arena pointers before they are stored into memory. This way, user space sees them as valid 64-bit pointers. Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF backend generate the bpf_addr_space_cast() instruction to cast pointers between address_space(1) which is reserved for bpf_arena pointers and default address space zero. All arena pointers in a bpf program written in C language are tagged as __attribute__((address_space(1))). Hence, clang provides helpful diagnostics when pointers cross address space. Libbpf and the kernel support only address_space == 1. All other address space identifiers are reserved. rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the verifier that rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar to copy_from_kernel_nofault() except that no address checks are necessary. The address is guaranteed to be in the 4GB range. If the page is not present, the destination register is zeroed on read, and the operation is ignored on write. rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the verifier converts such cast instructions to mov32. Otherwise, JIT will emit native code equivalent to: rX = (u32)rY; if (rY) rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */ After such conversion, the pointer becomes a valid user pointer within bpf_arena range. The user process can access data structures created in bpf_arena without any additional computations. For example, a linked list built by a bpf program can be walked natively by user space. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Barret Rhoden <brho@google.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
2024-03-11ravb: Correct buffer size to map for R-Car RxNiklas Söderlund
When creating a helper to allocate and align an skb one location where the skb data size was updated was missed. This can lead to a warning being printed when the memory is being unmapped as it now always unmap the maximum frame size, instead of the size after it have been aligned. This was correctly done for RZ/G2L but missed for R-Car. Fixes: cfbad64706c1 ("ravb: Create helper to allocate skb and align it") Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20240308224237.496924-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: amt: Remove generic .ndo_get_stats64Breno Leitao
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20240308162606.1597287-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: amt: Move stats allocation to coreBreno Leitao
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Move amt driver to leverage the core allocation. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20240308162606.1597287-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11netlink: specs: support generating code for genl socket privJakub Kicinski
The family struct is auto-generated for new families, support use of the sock_priv_* mechanism added in commit a731132424ad ("genetlink: introduce per-sock family private storage"). For example if the family wants to use struct sk_buff as its private struct (unrealistic but just for illustration), it would add to its spec: kernel-family: headers: [ "linux/skbuff.h" ] sock-priv: struct sk_buff ynl-gen-c will declare the appropriate priv size and hook in function prototypes to be implemented by the family. Link: https://lore.kernel.org/r/20240308190319.2523704-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>