Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"A quick summary: perf support for Branch Record Buffer Extensions
(BRBE), typical PMU hardware updates, small additions to MTE for
store-only tag checking and exposing non-address bits to signal
handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on.
There is also a TLBI optimisation on hardware that does not require
break-before-make when changing the user PTEs between contiguous and
non-contiguous.
More details:
Perf and PMU updates:
- Add support for new (v3) Hisilicon SLLC and DDRC PMUs
- Add support for Arm-NI PMU integrations that share interrupts
between clock domains within a given instance
- Allow SPE to be configured with a lower sample period than the
minimum recommendation advertised by PMSIDR_EL1.Interval
- Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)
- Adjust the perf watchdog period according to cpu frequency changes
- Minor driver fixes and cleanups
Hardware features:
- Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)
- Support for reporting the non-address bits during a synchronous MTE
tag check fault (FEAT_MTE_TAGGED_FAR)
- Optimise the TLBI when folding/unfolding contiguous PTEs on
hardware with FEAT_BBM (break-before-make) level 2 and no TLB
conflict aborts
Software features:
- Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
and using the text-poke API for late module relocations
- Force VMAP_STACK always on and change arm64_efi_rt_init() to use
arch_alloc_vmap_stack() in order to avoid KASAN false positives
ACPI:
- Improve SPCR handling and messaging on systems lacking an SPCR
table
Debug:
- Simplify the debug exception entry path
- Drop redundant DBG_MDSCR_* macros
Kselftests:
- Cleanups and improvements for SME, SVE and FPSIMD tests
Miscellaneous:
- Optimise loop to reduce redundant operations in contpte_ptep_get()
- Remove ISB when resetting POR_EL0 during signal handling
- Mark the kernel as tainted on SEA and SError panic
- Remove redundant gcs_free() call"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64/gcs: task_gcs_el0_enable() should use passed task
arm64: Kconfig: Keep selects somewhat alphabetically ordered
arm64: signal: Remove ISB when resetting POR_EL0
kselftest/arm64: Handle attempts to disable SM on SME only systems
kselftest/arm64: Fix SVE write data generation for SME only systems
kselftest/arm64: Test SME on SME only systems in fp-ptrace
kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
kselftest/arm64: Allow sve-ptrace to run on SME only systems
arm64/mm: Drop redundant addr increment in set_huge_pte_at()
kselftest/arm4: Provide local defines for AT_HWCAP3
arm64: Mark kernel as tainted on SAE and SError panic
arm64/gcs: Don't call gcs_free() when releasing task_struct
drivers/perf: hisi: Support PMUs with no interrupt
drivers/perf: hisi: Relax the event number check of v2 PMUs
drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
drivers/perf: hisi: Simplify the probe process for each DDRC version
perf/arm-ni: Support sharing IRQs within an NI instance
perf/arm-ni: Consolidate CPU affinity handling
...
|
|
Override the generic definition of modify_prot_start_ptes() to use
get_and_clear_full_ptes(). This helper does a TLBI only for the starting
and ending contpte block of the range, whereas the current implementation
will call ptep_get_and_clear() for every contpte block, thus doing a TLBI
on every contpte block. Therefore, we have a performance win.
The arm64 definition of pte_accessible() allows us to batch in the
errata specific case:
#define pte_accessible(mm, pte) \
(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
All ptes are obviously present in the folio batch, and they are also valid.
Override the generic definition of modify_prot_commit_ptes() to simply use
set_ptes() to map the new ptes into the pagetable.
Link: https://lkml.kernel.org/r/20250718090244.21092-8-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'for-next/misc', 'for-next/acpi', 'for-next/debug-entry', 'for-next/feat_mte_tagged_far', 'for-next/kselftest', 'for-next/mdscr-cleanup' and 'for-next/vmap-stack', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf: (23 commits)
drivers/perf: hisi: Support PMUs with no interrupt
drivers/perf: hisi: Relax the event number check of v2 PMUs
drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
drivers/perf: hisi: Simplify the probe process for each DDRC version
perf/arm-ni: Support sharing IRQs within an NI instance
perf/arm-ni: Consolidate CPU affinity handling
perf/cxlpmu: Fix typos in cxl_pmu.c comments and documentation
perf/cxlpmu: Remove unintended newline from IRQ name format string
perf/cxlpmu: Fix devm_kcalloc() argument order in cxl_pmu_probe()
perf: arm_spe: Relax period restriction
perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)
KVM: arm64: nvhe: Disable branch generation in nVHE guests
arm64: Handle BRBE booting requirements
arm64/sysreg: Add BRBE registers and fields
perf/arm: Add missing .suppress_bind_attrs
perf/arm-cmn: Reduce stack usage during discovery
perf: imx9_perf: make the read-only array mask static const
perf/arm-cmn: Broaden module description for wider interconnect support
...
* for-next/livepatch:
: Support for HAVE_LIVEPATCH on arm64
arm64: Kconfig: Keep selects somewhat alphabetically ordered
arm64: Implement HAVE_LIVEPATCH
arm64: stacktrace: Implement arch_stack_walk_reliable()
arm64: stacktrace: Check kretprobe_find_ret_addr() return value
arm64/module: Use text-poke API for late relocations.
* for-next/user-contig-bbml2:
: Optimise the TLBI when folding/unfolding contigous PTEs on hardware with BBML2 and no TLB conflict aborts
arm64/mm: Elide tlbi in contpte_convert() under BBML2
iommu/arm: Add BBM Level 2 smmu feature
arm64: Add BBM Level 2 cpu feature
arm64: cpufeature: Introduce MATCH_ALL_EARLY_CPUS capability type
* for-next/misc:
: Miscellaneous arm64 patches
arm64/gcs: task_gcs_el0_enable() should use passed task
arm64: signal: Remove ISB when resetting POR_EL0
arm64/mm: Drop redundant addr increment in set_huge_pte_at()
arm64: Mark kernel as tainted on SAE and SError panic
arm64/gcs: Don't call gcs_free() when releasing task_struct
arm64: fix unnecessary rebuilding when CONFIG_DEBUG_EFI=y
arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
arm64: pi: use 'targets' instead of extra-y in Makefile
* for-next/acpi:
: Various ACPI arm64 changes
ACPI: Suppress misleading SPCR console message when SPCR table is absent
ACPI: Return -ENODEV from acpi_parse_spcr() when SPCR support is disabled
* for-next/debug-entry:
: Simplify the debug exception entry path
arm64: debug: remove debug exception registration infrastructure
arm64: debug: split bkpt32 exception entry
arm64: debug: split brk64 exception entry
arm64: debug: split hardware watchpoint exception entry
arm64: debug: split single stepping exception entry
arm64: debug: refactor reinstall_suspended_bps()
arm64: debug: split hardware breakpoint exception entry
arm64: entry: Add entry and exit functions for debug exceptions
arm64: debug: remove break/step handler registration infrastructure
arm64: debug: call step handlers statically
arm64: debug: call software breakpoint handlers statically
arm64: refactor aarch32_break_handler()
arm64: debug: clean up single_step_handler logic
* for-next/feat_mte_tagged_far:
: Support for reporting the non-address bits during a synchronous MTE tag check fault
kselftest/arm64/mte: Add mtefar tests on check_mmap_options
kselftest/arm64/mte: Refactor check_mmap_option test
kselftest/arm64/mte: Add verification for address tag in signal handler
kselftest/arm64/mte: Add address tag related macro and function
kselftest/arm64/mte: Check MTE_FAR feature is supported
kselftest/arm64/mte: Register mte signal handler with SA_EXPOSE_TAGBITS
kselftest/arm64: Add MTE_FAR hwcap test
KVM: arm64: Expose FEAT_MTE_TAGGED_FAR feature to guest
arm64: Report address tag when FEAT_MTE_TAGGED_FAR is supported
arm64/cpufeature: Add FEAT_MTE_TAGGED_FAR feature
* for-next/kselftest:
: Kselftest updates for arm64
kselftest/arm64: Handle attempts to disable SM on SME only systems
kselftest/arm64: Fix SVE write data generation for SME only systems
kselftest/arm64: Test SME on SME only systems in fp-ptrace
kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
kselftest/arm64: Allow sve-ptrace to run on SME only systems
kselftest/arm4: Provide local defines for AT_HWCAP3
kselftest/arm64: Specify SVE data when testing VL set in sve-ptrace
kselftest/arm64: Fix test for streaming FPSIMD write in sve-ptrace
kselftest/arm64: Fix check for setting new VLs in sve-ptrace
kselftest/arm64: Convert tpidr2 test to use kselftest.h
* for-next/mdscr-cleanup:
: Drop redundant DBG_MDSCR_* macros
KVM: selftests: Change MDSCR_EL1 register holding variables as uint64_t
arm64/debug: Drop redundant DBG_MDSCR_* macros
* for-next/vmap-stack:
: Force VMAP_STACK on arm64
arm64: remove CONFIG_VMAP_STACK checks from entry code
arm64: remove CONFIG_VMAP_STACK checks from SDEI stack handling
arm64: remove CONFIG_VMAP_STACK checks from stacktrace overflow logic
arm64: remove CONFIG_VMAP_STACK conditionals from traps overflow stack
arm64: remove CONFIG_VMAP_STACK conditionals from irq stack setup
arm64: Remove CONFIG_VMAP_STACK conditionals from THREAD_SHIFT and THREAD_ALIGN
arm64: efi: Remove CONFIG_VMAP_STACK check
arm64: Mandate VMAP_STACK
arm64: efi: Fix KASAN false positive for EFI runtime stack
arm64/ptrace: Fix stack-out-of-bounds read in regs_get_kernel_stack_nth()
arm64/gcs: Don't call gcs_free() during flush_gcs()
arm64: Restrict pagetable teardown to avoid false warning
docs: arm64: Fix ICC_SRE_EL2 register typo in booting.rst
|
|
The 'addr' need not be incremented in the loop because that is not going to
be used subsequently.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250716035432.293682-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Set TAINT_MACHINE_CHECK when SError or Synchronous External Abort (SEA)
interrupts trigger a panic to flag potential hardware faults. This
tainting mechanism aids in debugging and enables correlation of
hardware-related crashes in large-scale deployments.
This change aligns with similar patches[1] that mark machine check
events when the system crashes due to hardware errors.
Link: https://lore.kernel.org/all/20250702-add_tain-v1-1-9187b10914b9@debian.org/ [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250716-vmcore_hw_error-v2-1-f187f7d62aba@debian.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Currently we call gcs_free() when releasing task_struct but this is
redundant, it attempts to deallocate any kernel managed userspace GCS
which should no longer be relevant and resets values in the struct we're
in the process of freeing.
By the time arch_release_task_struct() is called the mm will have been
disassociated from the task so the check for a mm in gcs_free() will
always be false, for threads that are exiting leaving the mm active
deactivate_mm() will have been called previously and freed any kernel
managed GCS.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250714-arm64-gcs-release-task-v2-1-8a83cadfc846@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com
Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In future we intend to change the vm_flags_t type, so it isn't correct for
architecture and driver code to assume it is unsigned long. Correct this
assumption across the board.
Overall, this patch does not introduce any functional change.
Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "use vm_flags_t consistently".
The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this
is exactly equivalent to unsigned long, but it should not be assumed to
be.
Much code that references vma->vm_flags already correctly uses vm_flags_t,
but a fairly large chunk of code simply uses unsigned long and assumes
that the two are equivalent.
This series corrects that and has us use vm_flags_t consistently.
This series is motivated by the desire to, in a future series, adjust
vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or
64-bit in order to deal with the VMA flag exhaustion issue and avoid all
the various problems that arise from it (being unable to use certain
features in 32-bit, being unable to add new flags except for 64-bit, etc.)
This is therefore a critical first step towards that goal. At any rate,
using the correct type is of value regardless.
We additionally take the opportunity to refer to VMA flags as vm_flags
where possible to make clear what we're referring to.
Overall, this series does not introduce any functional change.
This patch (of 3):
We abstract the type of the VMA flags to vm_flags_t, however in may places
it is simply assumed this is unsigned long, which is simply incorrect.
At the moment this is simply an incongruity, however in future we plan to
change this type and therefore this change is a critical requirement for
doing so.
Overall, this patch does not introduce any functional change.
[lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include]
Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local
Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that debug exceptions are handled individually and without the need
for dynamic registration, remove the unused registration infrastructure.
This removes the external caller for `debug_exception_enter()` and
`debug_exception_exit()`.
Make them static again and remove them from the header.
Remove `early_brk64()` as it has been made redundant by
(arm64: debug: split brk64 exception entry) and is not used anymore.
Note : in `early_brk64()` `bug_brk_handler()` is called unconditionally
as a fall-through, but now `call_break_hook()` only calls it if the
immediate matches.
This does not change the behaviour in early boot, as if
`bug_brk_handler()` was called on a non-BUG immediate it would return
DBG_HOOK_ERROR anyway, which `call_break_hook()` will do if no immediate
matches.
Remove `trap_init()`, as it would be empty and a weak definition already
exists in `init/main.c`.
Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Tested-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250707114109.35672-14-ada.coupriediaz@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Move the `debug_exception_enter()` and `debug_exception_exit()`
functions from mm/fault.c, as they are needed to split
the debug exceptions entry paths from the current unified one.
Make them externally visible in include/asm/exception.h until
the caller in mm/fault.c is cleaned up.
Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Tested-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250707114109.35672-7-ada.coupriediaz@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Register X0 contains PIE_E1_ASM and should not be written into REG_TCR2_EL1
which could have an adverse impact otherwise. This has remained undetected
till now probably because current value for PIE_E1_ASM (0xcc880e0ac0800000)
clears TCR2_EL1 which again gets set subsequently with 'tcr2' after testing
for FEAT_TCR2.
Drop this unwarranted 'msr' which is a stray change from an earlier commit.
This line got re-introduced when rebasing on top of the commit 926b66e2ebc8
("arm64: setup: name 'tcr2' register").
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Fixes: 7052e808c446 ("arm64/sysreg: Get rid of the TCR2_EL1x SysregFields")
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250704063812.298914-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We do not currently issue an ISB after updating POR_EL0 when
context-switching it, for instance. The rationale is that if the old
value of POR_EL0 is more restrictive and causes a fault during
uaccess, the access will be retried [1]. In other words, we are
trading an ISB on every context-switching for the (unlikely)
possibility of a spurious fault. We may also miss faults if the new
value of POR_EL0 is more restrictive, but that's considered
acceptable.
However, as things stand, a spurious Overlay fault results in
uaccess failing right away since it causes fault_from_pkey() to
return true. If an Overlay fault is reported, we therefore need to
double check POR_EL0 against vma_pkey(vma) - this is what
arch_vma_access_permitted() already does.
As it turns out, we already perform that explicit check if no
Overlay fault is reported, and we need to keep that check (see
comment added in fault_from_pkey()). Net result: the Overlay ISS2
bit isn't of much help to decide whether a pkey fault occurred.
Remove the check for the Overlay bit from fault_from_pkey() and
add a comment to try and explain the situation. While at it, also
add a comment to permission_overlay_switch() in case anyone gets
surprised by the lack of ISB.
[1] https://lore.kernel.org/linux-arm-kernel/ZtYNGBrcE-j35fpw@arm.com/
Fixes: 160a8e13de6c ("arm64: context switch POR_EL0 register")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Link: https://lore.kernel.org/r/20250619160042.2499290-2-kevin.brodsky@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
function by adding early termination logic. It checks if the dirty and
young bits of orig_pte are already set and skips redundant bit-setting
operations during the loop. This reduces unnecessary iterations and
improves performance.
In order to verify the optimization performance, a test function has been
designed. The function's execution time and instruction statistics have
been traced using perf, and the following are the operation results on a
certain Qualcomm mobile phone chip:
Test Code:
#include <stdlib.h>
#include <sys/mman.h>
#include <stdio.h>
#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
#define YOUNG_BIT 8
void rwdata(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
buf[i] = 'a';
volatile char c = buf[i];
}
}
void clear_young_dirty(char *buf)
{
if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
perror("madvise free failed");
free(buf);
exit(EXIT_FAILURE);
}
if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
perror("madvise free failed");
free(buf);
exit(EXIT_FAILURE);
}
}
void set_one_young(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
}
}
void test_contpte_perf() {
char *buf;
int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
TEST_SIZE);
if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
perror("posix_memalign failed");
exit(EXIT_FAILURE);
}
rwdata(buf);
#if TEST_CASE2 || TEST_CASE3
clear_young_dirty(buf);
#endif
#if TEST_CASE2
set_one_young(buf);
#endif
for (int j = 0; j < 500; j++) {
mlock(buf, TEST_SIZE);
munlock(buf, TEST_SIZE);
}
free(buf);
}
int main(void)
{
test_contpte_perf();
return 0;
}
Descriptions of three test scenarios
Scenario 1
The data of all 16 PTEs are both dirty and young.
#define TEST_CASE2 0
#define TEST_CASE3 0
Scenario 2
Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
#define TEST_CASE2 1
#define TEST_CASE3 0
Scenario 3
Among the 16 PTEs, there are neither young nor dirty ones.
#define TEST_CASE2 0
#define TEST_CASE3 1
Test results
|Scenario 1 | Original| Optimized|
|-------------------|---------------|----------------|
|instructions | 37912436160| 18731580031|
|test time | 4.2797| 2.2949|
|overhead of | | |
|contpte_ptep_get() | 21.31%| 4.80%|
|Scenario 2 | Original| Optimized|
|-------------------|---------------|----------------|
|instructions | 36701270862| 36115790086|
|test time | 3.2335| 3.0874|
|Overhead of | | |
|contpte_ptep_get() | 32.26%| 33.57%|
|Scenario 3 | Original| Optimized|
|-------------------|---------------|----------------|
|instructions | 36706279735| 36750881878|
|test time | 3.2008| 3.1249|
|Overhead of | | |
|contpte_ptep_get() | 31.94%| 34.59%|
For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
and a time benefit of 46.38%.
For Scenario 2, optimized code can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%.
For Scenario 3, since all the PTEs have neither the young nor the dirty
flag, the branches taken by optimized code should be the same as those of
the original code. In fact, the test results of optimized code seem to be
closer to those of the original code.
Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
Scenario 1: reduced to 56% of baseline execution time
Scenario 2: reduced to 89% of baseline execution time
Scenario 3: reduced to 91% of baseline execution time
It can be proven through test function that the optimization for
contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
is similar to that of contpte_ptep_get, the same optimization scheme is
also adopted for it.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Xavier Xia <xavier.qyxia@gmail.com>
Link: https://lore.kernel.org/r/20250624152549.2647828-1-xavier.qyxia@gmail.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
MDSCR_EL1 has already been defined in tools sysreg format and hence can be
used in all debug monitor related call paths. But using generated sysreg
definitions causes build warnings because there is a mismatch between mdscr
variable (u32) and GENMASK() based masks (long unsigned int). Convert all
variables handling MDSCR_EL1 register as u64 which also reflects its true
width as well.
--------------------------------------------------------------------------
arch/arm64/kernel/debug-monitors.c: In function ‘disable_debug_monitors’:
arch/arm64/kernel/debug-monitors.c:108:13: warning: conversion from ‘long
unsigned int’ to ‘u32’ {aka ‘unsigned int’} changes value from
‘18446744073709518847’ to ‘4294934527’ [-Woverflow]
108 | disable = ~MDSCR_EL1_MDE;
| ^
--------------------------------------------------------------------------
While here, replace an open encoding with MDSCR_EL1_TDCC in __cpu_setup().
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250613023646.1215700-2-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
If FEAT_MTE_TAGGED_FAR (Armv8.9) is supported, bits 63:60 of the fault address
are preserved in response to synchronous tag check faults (SEGV_MTESERR).
This patch modifies below to support this feature:
- Use the original FAR_EL1 value when an MTE tag check fault occurs,
if ARM64_MTE_FAR is supported so that not only logical tag
(bits 59:56) but also address tag (bits 63:60] being reported too.
- Add HWCAP for mtefar to let user know bits 63:60 includes
address tag information when when FEAT_MTE_TAGGED_FAR is supported.
Applications that require this information should install
a signal handler with the SA_EXPOSE_TAGBITS flag.
While this introduces a minor ABI change,
most applications do not set this flag and therefore will not be affected.
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250618084513.1761345-3-yeoreum.yun@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When converting a region via contpte_convert() to use mTHP, we have two
different goals. We have to mark each entry as contiguous, and we would
like to smear the dirty and young (access) bits across all entries in
the contiguous block. Currently, we do this by first accumulating the
dirty and young bits in the block, using an atomic
__ptep_get_and_clear() and the relevant pte_{dirty,young}() calls,
performing a tlbi, and finally smearing the correct bits across the
block using __set_ptes().
This approach works fine for BBM level 0, but with support for BBM level
2 we are allowed to reorder the tlbi to after setting the pagetable
entries. We expect the time cost of a tlbi to be much greater than the
cost of clearing and resetting the PTEs. As such, this reordering of the
tlbi outside the window where our PTEs are invalid greatly reduces the
duration the PTE are visibly invalid for other threads. This reduces the
likelyhood of a concurrent page walk finding an invalid PTE, reducing
the likelyhood of a fault in other threads, and improving performance
(more so when there are more threads).
Because we support via allowlist only bbml2 implementations that never
raise conflict aborts and instead invalidate the tlb entries
automatically in hardware, we can avoid the final flush altogether.
However, avoiding the intermediate tlbi+dsb must be carefully considered
to ensure that we remain both correct and performant. We document our
reasoning and the expected interactions further in the contpte_convert()
source. To do so we rely on the aarch64 spec (DDI 0487L.a D8.7.1.1)
requirements RNGLXZ and RJQQTC to provide guarantees that the elision is
correct.
Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20250625113435.26849-5-miko.lenczewski@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Commit 9c006972c3fe ("arm64: mmu: drop pXd_present() checks from
pXd_free_pYd_table()") removes the pxd_present() checks because the
caller checks pxd_present(). But, in case of vmap_try_huge_pud(), the
caller only checks pud_present(); pud_free_pmd_page() recurses on each
pmd through pmd_free_pte_page(), wherein the pmd may be none. Thus it is
possible to hit a warning in the latter, since pmd_none => !pmd_table().
Thus, add a pmd_present() check in pud_free_pmd_page().
This problem was found by code inspection.
Fixes: 9c006972c3fe ("arm64: mmu: drop pXd_present() checks from pXd_free_pYd_table()")
Cc: stable@vger.kernel.org
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250527082633.61073-1-dev.jain@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
|
* for-next/mm:
arm64/boot: Disallow BSS exports to startup code
arm64/boot: Move global CPU override variables out of BSS
arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace
arm64: mm: Drop redundant check in pmd_trans_huge()
arm64/mm: Permit lazy_mmu_mode to be nested
arm64/mm: Disable barrier batching in interrupt contexts
arm64/mm: Batch barriers when updating kernel mappings
mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
arm64/mm: Support huge pte-mapped pages in vmap
mm/vmalloc: Gracefully unmap huge ptes
mm/vmalloc: Warn on improper use of vunmap_range()
arm64/mm: Hoist barriers out of set_ptes_anysz() loop
arm64: hugetlb: Use __set_ptes_anysz() and __ptep_get_and_clear_anysz()
arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
mm/page_table_check: Batch-check pmds/puds just like ptes
arm64: hugetlb: Refine tlb maintenance scope
arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
arm64: pageattr: Explicitly bail out when changing permissions for vmalloc_huge mappings
arm64: Support ARM64_VA_BITS=52 when setting ARCH_MMAP_RND_BITS_MAX
arm64/mm: Remove randomization of the linear map
|
|
mov_q cannot really move PIE_E[0|1] macros into a general purpose register
as expected if those macro constants contain some 128 bit layout elements,
that are required for D128 page tables. The primary issue is that for D128,
PIE_E[0|1] are defined in terms of 128-bit types with shifting and masking,
which the assembler can't accommodate.
Instead pre-calculate these PIRE0_EL1/PIR_EL1 constants into asm-offsets.h
based PIE_E0_ASM/PIE_E1_ASM which can then be used in arch/arm64/mm/proc.S.
While here also drop PTE_MAYBE_NG/PTE_MAYBE_SHARED assembly overrides which
are not required any longer, as the compiler toolchains are smart enough to
compute both the PIE_[E0|E1]_ASM constants in all scenarios.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250429050511.1663235-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Constructors for PUD/P4D-level pgtables were recently introduced. They
should be called for all pgtables; make sure they are called for special
kernel mappings created by __create_pgd_mapping() too.
Link: https://lkml.kernel.org/r/20250408095222.860601-12-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
TL;DR: always call the PTE/PMD ctor, passing the appropriate mm to skip
ptlock_init() if unneeded.
__create_pgd_mapping() is used for creating different kinds of mappings,
and may allocate page table pages if passed an allocator callback. There
are currently three such cases:
1. create_pgd_mapping(), which is used to create the EFI mapping
2. arch_add_memory()
3. map_entry_trampoline()
1. uses pgd_pgtable_alloc() as allocator callback, which calls the
PTE/PMD ctor, while 2. and 3. use __pgd_pgtable_alloc(), which does not.
The rationale is most likely that pgtables associated with init_mm do not
make use of split page table locks, and it is therefore unnecessary to
initialise them by calling the ctor. 2. operates on swapper_pg_dir so
the allocated pgtables are clearly associated with init_mm, this is
arguably the case for 3. too (the trampoline mapping is never modified so
ptlocks are anyway irrelevant). 1. corresponds to efi_mm so ptlocks need
to be initialised in that case.
We are now moving towards calling the ctor for all page tables, even those
associated with init_mm. pagetable_{pte,pmd}_ctor() have become aware of
the associated mm so that the ptlock initialisation can be skipped for
init_mm. This patch therefore amends the allocator callbacks so that the
PTE/PMD ctor are always called, with an appropriate mm pointer to avoid
unnecessary ptlock overhead.
Modifying the prototype of the allocator callbacks to take the mm and
propagating that pointer all the way down would be pretty invasive.
Instead:
* __pgd_pgtable_alloc() (cases 2. and 3. above) is replaced with
pgd_pgtable_alloc_init_mm(), resulting in the ctors being called with
&init_mm. This is the main functional change in this patch; the ptlock
still isn't initialised, but other ctor actions (e.g.
accounting-related) are now carried out for those allocated pgtables.
* pgd_pgtable_alloc() (case 1. above) is replaced with
pgd_pgtable_alloc_special_mm(), resulting in the ctors being called with
NULL as mm. No functional change here; NULL essentially means "not
init_mm", and the ptlock is still initialised.
__pgd_pgtable_alloc() is now the common implementation of those two
helpers. While at it we switch it to using pagetable_alloc() like
standard pgtable allocator functions and remove the comment regarding ctor
calls (ctors are now always expected to be called).
Link: https://lkml.kernel.org/r/20250408095222.860601-10-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 90292aca9854 ("arm64: mm: use appropriate ctors for page tables")
introduced pgtable ctor calls in pgd_pgtable_alloc(). To identify the
pgtable level and call the appropriate ctor, the *_SHIFT value associated
with the pgtable level is used. However, those values do not
unambiguously identify a level, because if a given level is folded, the
*_SHIFT value will be equal to that of the upper level (e.g. PMD_SHIFT ==
PUD_SHIFT if PMD is folded).
As things stand, there is probably not much damaged done by calling the
ctor for a different level, and ARCH_ENABLE_SPLIT_PMD_PTLOCK is only
selected if PMD isn't folded (so we don't needlessly initialise
pmd_ptlock). Still, this is pretty confusing, and it would get even more
confusing when adding ctor calls for the remaining levels.
Let's simplify all this by using an enum to identify the pgtable level
instead; this way folding becomes irrelevant. This is inspired by one of
the m68k pgtable allocators (arch/m68k/include/asm/motorola_pgalloc.h).
Link: https://lkml.kernel.org/r/20250408095222.860601-9-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Always call constructor for kernel page tables", v2.
There has been much confusion around exactly when page table
constructors/destructors (pagetable_*_[cd]tor) are supposed to be called.
They were initially introduced for user PTEs only (to support split page
table locks), then at the PMD level for the same purpose. Accounting was
added later on, starting at the PTE level and then moving to higher levels
(PMD, PUD). Finally, with my earlier series "Account page tables at all
levels" [1], the ctor/dtor is run for all levels, all the way to PGD.
I thought this was the end of the story, and it hopefully is for user
pgtables, but I was wrong for what concerns kernel pgtables. The current
situation there makes very little sense:
* At the PTE level, the ctor/dtor is not called (at least in the generic
implementation). Specific helpers are used for kernel pgtables at this
level (pte_{alloc,free}_kernel()) and those have never called the
ctor/dtor, most likely because they were initially irrelevant in the
kernel case.
* At all other levels, the ctor/dtor is normally called. This is
potentially wasteful at the PMD level (more on that later).
This series aims to ensure that the ctor/dtor is always called for kernel
pgtables, as it already is for user pgtables. Besides consistency, the
main motivation is to guarantee that ctor/dtor hooks are systematically
called; this makes it possible to insert hooks to protect page tables [2],
for instance. There is however an extra challenge: split locks are not
used for kernel pgtables, and it would therefore be wasteful to initialise
them (ptlock_init()).
It is worth clarifying exactly when split locks are used. They clearly
are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM:
8591/1: mm: use fully constructed struct pages for EFI pgd allocations"),
they also are for special page tables like efi_mm. The one case where
split locks are definitely unused is pgtables owned by init_mm; this is
consistent with the behaviour of apply_to_pte_range().
The approach chosen in this series is therefore to pass the mm associated
to the pgtables being constructed to pagetable_{pte,pmd}_ctor() (patch 1),
and skip ptlock_init() if mm == &init_mm (patch 3 and 7). This makes it
possible to call the PTE ctor/dtor from pte_{alloc,free}_kernel() without
unintended consequences (patch 3). As a result the accounting functions
are now called at all levels for kernel pgtables, and split locks are
never initialised.
In configurations where ptlocks are dynamically allocated (32-bit,
PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this
series results in the removal of a kmem_cache allocation for every kernel
PMD. Additionally, for certain architectures that do not use
<asm-generic/pgalloc.h> such as s390, the same optimisation occurs at the
PTE level.
===
Things get more complicated when it comes to special pgtable allocators
(patch 8-12). All architectures need such allocators to create initial
kernel pgtables; we are not concerned with those as the ctor cannot be
called so early in the boot sequence. However, those allocators may also
be used later in the boot sequence or during normal operations. There are
two main use-cases:
1. Mapping EFI memory: efi_mm (arm, arm64, riscv)
2. arch_add_memory(): init_mm
The ctor is already explicitly run (at the PTE/PMD level) in the first
case, as required for pgtables that are not associated with init_mm.
However the same allocators may also be used for the second use-case (or
others), and this is where it gets messy. Patch 1 calls the ctor with
NULL as mm in those situations, as the actual mm isn't available.
Practically this means that ptlocks will be unconditionally initialised.
This is fine on arm - create_mapping_late() is only used for the EFI
mapping. On arm64, __create_pgd_mapping() is also used by
arch_add_memory(); patch 8/9/11 ensure that ctors are called at all levels
with the appropriate mm. The situation is similar on riscv, but
propagating the mm down to the ctor would require significant refactoring.
Since they are already called unconditionally, this series leaves riscv
no worse off - patch 10 adds comments to clarify the situation.
From a cursory look at other architectures implementing arch_add_memory(),
s390 and x86 may also need a similar treatment to add constructor calls.
This is to be taken care of in a future version or as a follow-up.
===
The complications in those special pgtable allocators beg the question:
does it really make sense to treat efi_mm and init_mm differently in e.g.
apply_to_pte_range()? Maybe what we really need is a way to tell if an mm
corresponds to user memory or not, and never use split locks for non-user
mm's. Feedback and suggestions welcome!
This patch (of 12):
In preparation for calling constructors for all kernel page tables while
eliding unnecessary ptlock initialisation, let's pass down the associated
mm to the PTE/PMD level ctors. (These are the two levels where ptlocks
are used.)
In most cases the mm is already around at the point of calling the ctor so
we simply pass it down. This is however not the case for special page
table allocators:
* arch/arm/mm/mmu.c
* arch/arm64/mm/mmu.c
* arch/riscv/mm/init.c
In those cases, the page tables being allocated are either for standard
kernel memory (init_mm) or special page directories, which may not be
associated to any mm. For now let's pass NULL as mm; this will be refined
where possible in future patches.
No functional change in this patch.
Link: https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ [1]
Link: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ [2]
Link: https://lkml.kernel.org/r/20250408095222.860601-1-kevin.brodsky@arm.com
Link: https://lkml.kernel.org/r/20250408095222.860601-2-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: <x86@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Define ptdesc_t type which describes the basic page table descriptor
layout on arm64 platform. Subsequently all level specific pxxval_t
descriptors are derived from ptdesc_t thus establishing a common original
format, which can also be appropriate for page table entries, masks and
protection values etc which are used at all page table levels.
Link: https://lkml.kernel.org/r/20250407053113.746295-4-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/ptdump: Drop assumption that pxd_val() is u64", v2.
Last argument passed down in note_page() is u64 assuming pxd_val()
returned value (all page table levels) is 64 bit - which might not be the
case going ahead when D128 page tables is enabled on arm64 platform.
Besides pxd_val() is very platform specific and its type should not be
assumed in generic MM. A similar problem exists for effective_prot(),
although it is restricted to x86 platform.
This series splits note_page() and effective_prot() into individual page
table level specific callbacks which accepts corresponding pxd_t page
table entry as an argument instead and later on all subscribing platforms
could derive pxd_val() from the table entries as required and proceed as
before.
Define ptdesc_t type which describes the basic page table descriptor
layout on arm64 platform. Subsequently all level specific pxxval_t
descriptors are derived from ptdesc_t thus establishing a common original
format, which can also be appropriate for page table entries, masks and
protection values etc which are used at all page table levels.
This patch (of 3):
Last argument passed down in note_page() is u64 assuming pxd_val()
returned value (all page table levels) is 64 bit - which might not be the
case going ahead when D128 page tables is enabled on arm64 platform.
Besides pxd_val() is very platform specific and its type should not be
assumed in generic MM.
Split note_page() into individual page table level specific callbacks
which accepts corresponding pxd_t argument instead and then subscribing
platforms just derive pxd_val() from the entries as required and proceed
as earlier.
Also add a note_page_flush() callback for flushing the last page table
page that was being handled earlier via level = -1.
Link: https://lkml.kernel.org/r/20250407053113.746295-1-anshuman.khandual@arm.com
Link: https://lkml.kernel.org/r/20250407053113.746295-2-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Implement the required arch functions to enable use of contpte in the
vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
operations due to only having to issue a DSB and ISB per contpte block
instead of per pte. But it also means that the TLB pressure reduces due
to only needing a single TLB entry for the whole contpte block.
Since vmap uses set_huge_pte_at() to set the contpte, that API is now
used for kernel mappings for the first time. Although in the vmap case
we never expect it to be called to modify a valid mapping so
clear_flush() should never be called, it's still wise to make it robust
for the kernel case, so amend the tlb flush function if the mm is for
kernel space.
Tested with vmalloc performance selftests:
# kself/mm/test_vmalloc.sh \
run_test_mask=1
test_repeat_count=5
nr_pages=256
test_loop_count=100000
use_huge=1
Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
reduction in time taken.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-10-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Refactor the huge_pte helpers to use the new common __set_ptes_anysz()
and __ptep_get_and_clear_anysz() APIs.
This provides 2 benefits; First, when page_table_check=on, hugetlb is
now properly/fully checked. Previously only the first page of a hugetlb
folio was checked. Second, instead of having to call __set_ptes(nr=1)
for each pte in a loop, the whole contiguous batch can now be set in one
go, which enables some efficiencies and cleans up the code.
One detail to note is that huge_ptep_clear_flush() was previously
calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
block mapping). This has a couple of disadvantages; first
ptep_clear_flush() calls ptep_get_and_clear() which transparently
handles contpte. Given we only call for non-contiguous ptes, it would be
safe, but a waste of effort. It's preferable to go straight to the layer
below. However, more problematic is that ptep_get_and_clear() is for
PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
clear the whole hugetlb folio. So let's stop special-casing the non-cont
case and just rely on get_clear_contig_flush() to do the right thing for
non-cont entries.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-6-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When operating on contiguous blocks of ptes (or pmds) for some hugetlb
sizes, we must honour break-before-make requirements and clear down the
block to invalid state in the pgtable then invalidate the relevant tlb
entries before making the pgtable entries valid again.
However, the tlb maintenance is currently always done assuming the worst
case stride (PAGE_SIZE), last_level (false) and tlb_level
(TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
we know the stride from the huge_pte pgsize, we are always operating
only on the last level, and we always know the tlb_level, again based on
pgsize. So let's start providing these hints.
Additionally, avoid tlb maintenace in set_huge_pte_at().
Break-before-make is only required if we are transitioning the
contiguous pte block from valid -> valid. So let's elide the
clear-and-flush ("break") if the pte range was previously invalid.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-3-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Not all huge_pte helper APIs explicitly provide the size of the
huge_pte. So the helpers have to depend on various methods to determine
the size of the huge_pte. Some of these methods are dubious.
Let's clean up the code to use preferred methods and retire the dubious
ones. The options in order of preference:
- If size is provided as parameter, use it together with
num_contig_ptes(). This is explicit and works for both present and
non-present ptes.
- If vma is provided as a parameter, retrieve size via
huge_page_size(hstate_vma(vma)) and use it together with
num_contig_ptes(). This is explicit and works for both present and
non-present ptes.
- If the pte is present and contiguous, use find_num_contig() to walk
the pgtable to find the level and infer the number of ptes from
level. Only works for *present* ptes.
- If the pte is present and not contiguous and you can infer from this
that only 1 pte needs to be operated on. This is ok if you don't care
about the absolute size, and just want to know the number of ptes.
- NEVER rely on resolving the PFN of a present pte to a folio and
getting the folio's size. This is fragile at best, because there is
nothing to stop the core-mm from allocating a folio twice as big as
the huge_pte then mapping it across 2 consecutive huge_ptes. Or just
partially mapping it.
Where we require that the pte is present, add warnings if not-present.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
vmalloc_huge mappings
arm64 uses apply_to_page_range to change permissions for kernel vmalloc mappings,
which does not support changing permissions for block mappings. This function
will change permissions until it encounters a block mapping, and will bail
out with a warning. Since there are no reports of this triggering, it
implies that there are currently no cases of code doing a vmalloc_huge()
followed by partial permission change. But this is a footgun waiting to
go off, so let's detect it early and avoid the possibility of permissions
in an intermediate state. So, explicitly disallow changing permissions
for VM_ALLOW_HUGE_VMAP mappings.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250403052844.61818-1-dev.jain@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since commit
97d6786e0669 ("arm64: mm: account for hotplug memory when randomizing the linear region")
the decision whether or not to randomize the placement of the system's
DRAM inside the linear map is based on the capabilities of the CPU
rather than how much memory is present at boot time. This change was
necessary because memory hotplug may result in DRAM appearing in places
that are not covered by the linear region at all (and therefore
unusable) if the decision is solely based on the memory map at boot.
In the Android GKI kernel, which requires support for memory hotplug,
and is built with a reduced virtual address space of only 39 bits wide,
randomization of the linear map never happens in practice as a result.
And even on arm64 kernels built with support for 48 bit virtual
addressing, the wider PArange of recent CPUs means that linear map
randomization is slowly becoming a feature that only works on systems
that will soon be obsolete.
So let's just remove this feature. We can always bring it back in an
improved form if there is a real need for it.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250318134949.3194334-2-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix max_pfn calculation when hotplugging memory so that it never
decreases
- Fix dereference of unused source register in the MOPS SET operation
fault handling
- Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user
space does an unaligned LDREX/STREX
- Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs
- Drop unused code pud accessors (special/mkspecial)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Don't call NULL in do_compat_alignment_fixup()
arm64: Add support for HIP09 Spectre-BHB mitigation
arm64: mm: Drop dead code for pud special bit handling
arm64: mops: Do not dereference src reg for a set operation
arm64: mm: Correct the update of max_pfn
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "powerpc/crash: use generic crashkernel reservation" from
Sourabh Jain changes powerpc's kexec code to use more of the generic
layers.
- The series "get_maintainer: report subsystem status separately" from
Vlastimil Babka makes some long-requested improvements to the
get_maintainer output.
- The series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the
ucount code.
- The series "reboot: support runtime configuration of emergency
hw_protection action" from Ahmad Fatoum improves the ability for a
driver to perform an emergency system shutdown or reboot.
- The series "Converge on using secs_to_jiffies() part two" from Easwar
Hariharan performs further migrations from msecs_to_jiffies() to
secs_to_jiffies().
- The series "lib/interval_tree: add some test cases and cleanup" from
Wei Yang permits more userspace testing of kernel library code, adds
some more tests and performs some cleanups.
- The series "hung_task: Dump the blocking task stacktrace" from Masami
Hiramatsu arranges for the hung_task detector to dump the stack of
the blocking task and not just that of the blocked task.
- The series "resource: Split and use DEFINE_RES*() macros" from Andy
Shevchenko provides some cleanups to the resource definition macros.
- Plus the usual shower of singleton patches - please see the
individual changelogs for details.
* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
mailmap: consolidate email addresses of Alexander Sverdlin
fs/procfs: fix the comment above proc_pid_wchan()
relay: use kasprintf() instead of fixed buffer formatting
resource: replace open coded variant of DEFINE_RES()
resource: replace open coded variants of DEFINE_RES_*_NAMED()
resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
samples: add hung_task detector mutex blocking sample
hung_task: show the blocker task if the task is hung on mutex
kexec_core: accept unaccepted kexec segments' destination addresses
watchdog/perf: optimize bytes copied and remove manual NUL-termination
lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
lib/interval_tree: skip the check before go to the right subtree
lib/interval_tree: add test case for span iteration
lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
lib/rbtree: add random seed
lib/rbtree: split tests
lib/rbtree: enable userland test suite for rbtree related data structure
checkpatch: describe --min-conf-desc-length
scripts/gdb/symbols: determine KASLR offset on s390
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
Hotplugged memory can be smaller than the original memory. For example,
on my target:
root@genericarmv8:~# cat /sys/kernel/debug/memblock/memory
0: 0x0000000064005000..0x0000000064023fff 0 NOMAP
1: 0x0000000064400000..0x00000000647fffff 0 NOMAP
2: 0x0000000068000000..0x000000006fffffff 0 DRV_MNG
3: 0x0000000088800000..0x0000000094ffefff 0 NONE
4: 0x0000000094fff000..0x0000000094ffffff 0 NOMAP
max_pfn will affect read_page_owner. Therefore, it should first compare and
then select the larger value for max_pfn.
Fixes: 8fac67ca236b ("arm64: mm: update max_pfn after memory hotplug")
Cc: <stable@vger.kernel.org> # 6.1.x
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250321070019.1271859-1-quic_zhenhuah@quicinc.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Nothing major this time around.
Apart from the usual perf/PMU updates, some page table cleanups, the
notable features are average CPU frequency based on the AMUv1
counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in
the uaccess routines.
Perf and PMUs:
- Support for the 'Rainier' CPU PMU from Arm
- Preparatory driver changes and cleanups that pave the way for BRBE
support
- Support for partial virtualisation of the Apple-M1 PMU
- Support for the second event filter in Arm CSPMU designs
- Minor fixes and cleanups (CMN and DWC PMUs)
- Enable EL2 requirements for FEAT_PMUv3p9
Power, CPU topology:
- Support for AMUv1-based average CPU frequency
- Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It
adds a generic topology_is_primary_thread() function overridden by
x86 and powerpc
New(ish) features:
- MOPS (memcpy/memset) support for the uaccess routines
Security/confidential compute:
- Fix the DMA address for devices used in Realms with Arm CCA. The
CCA architecture uses the address bit to differentiate between
shared and private addresses
- Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
default
Memory management clean-ups:
- Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs
- Some minor page table accessor clean-ups
- PIE/POE (permission indirection/overlay) helpers clean-up
Kselftests:
- MTE: skip hugetlb tests if MTE is not supported on such mappings
and user correct naming for sync/async tag checking modes
Miscellaneous:
- Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
request)
- Sysreg updates for new register fields
- CPU type info for some Qualcomm Kryo cores"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
arm64: mm: Don't use %pK through printk
perf/arm_cspmu: Fix missing io.h include
arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
arm64: cputype: Add MIDR_CORTEX_A76AE
arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
arm64/sysreg: Enforce whole word match for open/close tokens
arm64/sysreg: Fix unbalanced closing block
arm64: Kconfig: Enable HOTPLUG_SMT
arm64: topology: Support SMT control on ACPI based system
arch_topology: Support SMT control for OF based system
cpu/SMT: Provide a default topology_is_primary_thread()
arm64/mm: Define PTDESC_ORDER
perf/arm_cspmu: Add PMEVFILT2R support
perf/arm_cspmu: Generalise event filtering
perf/arm_cspmu: Move register definitons to header
arm64/kernel: Always use level 2 or higher for early mappings
arm64/mm: Drop PXD_TABLE_BIT
arm64/mm: Check pmd_table() in pmd_trans_huge()
...
|
|
'for-next/sysreg', 'for-next/misc', 'for-next/pgtable-cleanups', 'for-next/kselftest', 'for-next/uaccess-mops', 'for-next/pie-poe-cleanup', 'for-next/cputype-kryo', 'for-next/cca-dma-address', 'for-next/drop-pxd_table_bit' and 'for-next/spectre-bhb-assume-vulnerable', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf/arm_cspmu: Fix missing io.h include
perf/arm_cspmu: Add PMEVFILT2R support
perf/arm_cspmu: Generalise event filtering
perf/arm_cspmu: Move register definitons to header
drivers/perf: apple_m1: Support host/guest event filtering
drivers/perf: apple_m1: Refactor event select/filter configuration
perf/dwc_pcie: fix duplicate pci_dev devices
perf/dwc_pcie: fix some unreleased resources
perf/arm-cmn: Minor event type housekeeping
perf: arm_pmu: Move PMUv3-specific data
perf: apple_m1: Don't disable counter in m1_pmu_enable_event()
perf: arm_v7_pmu: Don't disable counter in (armv7|krait_|scorpion_)pmu_enable_event()
perf: arm_v7_pmu: Drop obvious comments for enabling/disabling counters and interrupts
perf: arm_pmuv3: Don't disable counter in armv8pmu_enable_event()
perf: arm_pmu: Don't disable counter in armpmu_add()
perf: arm_pmuv3: Call kvm_vcpu_pmu_resync_el0() before enabling counters
perf: arm_pmuv3: Add support for ARM Rainier PMU
* for-next/amuv1-avg-freq:
: Add support for AArch64 AMUv1-based average freq
arm64: Utilize for_each_cpu_wrap for reference lookup
arm64: Update AMU-based freq scale factor on entering idle
arm64: Provide an AMU-based version of arch_freq_get_on_cpu
cpufreq: Introduce an optional cpuinfo_avg_freq sysfs entry
cpufreq: Allow arch_freq_get_on_cpu to return an error
arch_topology: init capacity_freq_ref to 0
* for-next/pkey_unrestricted:
: mm/pkey: Add PKEY_UNRESTRICTED macro
selftest/powerpc/mm/pkey: fix build-break introduced by commit 00894c3fc917
selftests/powerpc: Use PKEY_UNRESTRICTED macro
selftests/mm: Use PKEY_UNRESTRICTED macro
mm/pkey: Add PKEY_UNRESTRICTED macro
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Enforce whole word match for open/close tokens
arm64/sysreg: Fix unbalanced closing block
arm64/sysreg: Add register fields for HFGWTR2_EL2
arm64/sysreg: Add register fields for HFGRTR2_EL2
arm64/sysreg: Add register fields for HFGITR2_EL2
arm64/sysreg: Add register fields for HDFGWTR2_EL2
arm64/sysreg: Add register fields for HDFGRTR2_EL2
arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1
* for-next/misc:
: Miscellaneous arm64 patches
arm64: mm: Don't use %pK through printk
arm64/fpsimd: Remove unused declaration fpsimd_kvm_prepare()
* for-next/pgtable-cleanups:
: arm64 pgtable accessors cleanup
arm64/mm: Define PTDESC_ORDER
arm64/kernel: Always use level 2 or higher for early mappings
arm64/hugetlb: Consistently use pud_sect_supported()
arm64/mm: Convert __pte_to_phys() and __phys_to_pte_val() as functions
* for-next/kselftest:
: arm64 kselftest updates
kselftest/arm64: mte: Skip the hugetlb tests if MTE not supported on such mappings
kselftest/arm64: mte: Use the correct naming for tag check modes in check_hugetlb_options.c
* for-next/uaccess-mops:
: Implement the uaccess memory copy/set using MOPS instructions
arm64: lib: Use MOPS for usercopy routines
arm64: mm: Handle PAN faults on uaccess CPY* instructions
arm64: extable: Add fixup handling for uaccess CPY* instructions
* for-next/pie-poe-cleanup:
: PIE/POE helpers cleanup
arm64/sysreg: Move POR_EL0_INIT to asm/por.h
arm64/sysreg: Rename POE_RXW to POE_RWX
arm64/sysreg: Improve PIR/POR helpers
* for-next/cputype-kryo:
: Add cputype info for some Qualcomm Kryo cores
arm64: cputype: Add comments about Qualcomm Kryo 5XX and 6XX cores
arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD
* for-next/cca-dma-address:
: Fix DMA address for devices used in realms with Arm CCA
arm64: realm: Use aliased addresses for device DMA to shared buffers
dma: Introduce generic dma_addr_*crypted helpers
dma: Fix encryption bit clearing for dma_to_phys
* for-next/drop-pxd_table_bit:
: Drop the arm64 PXD_TABLE_BIT (clean-up in preparation for 128-bit PTEs)
arm64/mm: Drop PXD_TABLE_BIT
arm64/mm: Check pmd_table() in pmd_trans_huge()
arm64/mm: Check PUD_TYPE_TABLE in pud_bad()
arm64/mm: Check PXD_TYPE_TABLE in [p4d|pgd]_bad()
arm64/mm: Clear PXX_TYPE_MASK and set PXD_TYPE_SECT in [pmd|pud]_mkhuge()
arm64/mm: Clear PXX_TYPE_MASK in mk_[pmd|pud]_sect_prot()
arm64/ptdump: Test PMD_TYPE_MASK for block mapping
KVM: arm64: ptdump: Test PMD_TYPE_MASK for block mapping
* for-next/spectre-bhb-assume-vulnerable:
: Rework Spectre BHB mitigations to not assume "safe"
arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
arm64: cputype: Add MIDR_CORTEX_A76AE
arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
|
|
Restricted pointers ("%pK") are not meant to be used through printk().
It can unintentionally expose security sensitive, raw pointer values.
Use regular pointer formatting instead.
Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://lore.kernel.org/r/20250217-restricted-pointers-arm64-v1-1-14bb1f516b01@linutronix.de
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The point where the memory is released from memblock to the buddy
allocator is hidden inside arch-specific mem_init()s and the call to
memblock_free_all() is needlessly duplicated in every artiste cure and
after introduction of arch_mm_preinit() hook, mem_init() implementation on
many architecture only contains the call to memblock_free_all().
Pull memblock_free_all() call into mm_core_init() and drop mem_init() on
relevant architectures to make it more explicit where the free memory is
released from memblock to the buddy allocator and to reduce code
duplication in architecture specific code.
Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, implementation of mem_init() in every architecture consists of
one or more of the following:
* initializations that must run before page allocator is active, for
instance swiotlb_init()
* a call to memblock_free_all() to release all the memory to the buddy
allocator
* initializations that must run after page allocator is ready and there is
no arch-specific hook other than mem_init() for that, like for example
register_page_bootmem_info() in x86 and sparc64 or simple setting of
mem_init_done = 1 in several architectures
* a bunch of semi-related stuff that apparently had no better place to
live, for example a ton of BUILD_BUG_ON()s in parisc.
Introduce arch_mm_preinit() that will be the first thing called from
mm_core_init(). On architectures that have initializations that must happen
before the page allocator is ready, move those into arch_mm_preinit() along
with the code that does not depend on ordering with page allocator setup.
On several architectures this results in reduction of mem_init() to a
single call to memblock_free_all() that allows its consolidation next.
Link: https://lkml.kernel.org/r/20250313135003.836600-13-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
high_memory defines upper bound on the directly mapped memory. This bound
is defined by the beginning of ZONE_HIGHMEM when a system has high memory
and by the end of memory otherwise.
All this is known to generic memory management initialization code that
can set high_memory while initializing core mm structures.
Add a generic calculation of high_memory to free_area_init() and remove
per-architecture calculation except for the architectures that set and use
high_memory earlier than that.
Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP.
But generic ptdump gets enabled via PTDUMP_CORE. These configs
combination is confusing as they sound very similar and does not
differentiate between platform's feature subscription and feature
enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP
making it more clear and improve readability.
Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
cmdline argument is not used in reserve_crashkernel_generic() so remove
it. Correspondingly, all the callers have been updated as well.
No functional change intended.
Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ioremap_prot() currently accepts pgprot_val parameter as an unsigned long,
thus implicitly assuming that pgprot_val and pgprot_t could never be
bigger than unsigned long. But this assumption soon will not be true on
arm64 when using D128 pgtables. In 128 bit page table configuration,
unsigned long is 64 bit, but pgprot_t is 128 bit.
Passing platform abstracted pgprot_t argument is better as compared to
size based data types. Let's change the parameter to directly pass
pgprot_t like another similar helper generic_ioremap_prot().
Without this change in place, D128 configuration does not work on arm64 as
the top 64 bits gets silently stripped when passing the protection value
to this function.
Link: https://lkml.kernel.org/r/20250218101954.415331-1-anshuman.khandual@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch lays the groundwork for supporting batch PTE unmapping in
try_to_unmap_one(). It introduces range handling for TLB batch flushing,
with the range currently set to the size of PAGE_SIZE.
The function __flush_tlb_range_nosync() is architecture-specific and is
only used within arch/arm64. This function requires the mm structure
instead of the vma structure. To allow its reuse by
arch_tlbbatch_add_pending(), which operates with mm but not vma, this
patch modifies the argument of __flush_tlb_range_nosync() to take mm as
its parameter.
Link: https://lkml.kernel.org/r/20250214093015.51024-3-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chis Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mauricio Faria de Oliveira <mfo@canonical.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Address bytes shifted with a single 64 bit page table entry (any page table
level) has been always hard coded as 3 (aka 2^3 = 8). Although intuitive it
is not very readable or easy to reason about. Besides it is going to change
with D128, where each 128 bit page table entry will shift address bytes by
4 (aka 2^4 = 16) instead.
Let's just formalise this address bytes shift value into a new macro called
PTDESC_ORDER establishing a logical abstraction, thus improving readability
as well. While here re-organize EARLY_LEVEL macro along with its dependents
for better clarity. This does not cause any functional change. Also replace
all (PAGE_SHIFT - PTDESC_ORDER) instances with PTDESC_TABLE_SHIFT.
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kasan-dev@googlegroups.com
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250311045710.550625-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Test given page table entries against PMD_TYPE_SECT on PMD_TYPE_MASK mask
bits for identifying block mappings in stage 1 page tables.
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250221044227.1145393-3-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|