Age | Commit message (Collapse) | Author |
|
After
034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
rewrote __copy_user_nocache() to use EX_TYPE_UACCESS instead of the
EX_TYPE_COPY exception type, there are no more EX_TYPE_COPY users, so
remove it.
[ bp: Massage commit message. ]
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240204082627.3892816-2-tongtiangen@huawei.com
|
|
Systems with a large number of CPUs may generate a large number of
machine check records when things go seriously wrong. But Linux has
a fixed-size buffer that can only capture a few dozen errors.
Allocate space based on the number of CPUs (with a minimum value based
on the historical fixed buffer that could store 80 records).
[ bp: Rename local var from tmpp to something more telling: gpool. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>
Link: https://lore.kernel.org/r/20240307192704.37213-1-tony.luck@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixlet from Borislav Petkov:
- Constify yet another static struct bus_type instance now that the
driver core can handle that
* tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Make mce_subsys const
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED support from Thomas Gleixner:
"Support for x86 Fast Return and Event Delivery (FRED).
FRED is a replacement for IDT event delivery on x86 and addresses most
of the technical nightmares which IDT exposes:
1) Exception cause registers like CR2 need to be manually preserved
in nested exception scenarios.
2) Hardware interrupt stack switching is suboptimal for nested
exceptions as the interrupt stack mechanism rewinds the stack on
each entry which requires a massive effort in the low level entry
of #NMI code to handle this.
3) No hardware distinction between entry from kernel or from user
which makes establishing kernel context more complex than it needs
to be especially for unconditionally nestable exceptions like NMI.
4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
is a problem when the perf NMI takes a fault when collecting a
stack trace.
5) Partial restore of ESP when returning to a 16-bit segment
6) Limitation of the vector space which can cause vector exhaustion
on large systems.
7) Inability to differentiate NMI sources
FRED addresses these shortcomings by:
1) An extended exception stack frame which the CPU uses to save
exception cause registers. This ensures that the meta information
for each exception is preserved on stack and avoids the extra
complexity of preserving it in software.
2) Hardware interrupt stack switching is non-rewinding if a nested
exception uses the currently interrupt stack.
3) The entry points for kernel and user context are separate and GS
BASE handling which is required to establish kernel context for
per CPU variable access is done in hardware.
4) NMIs are now nesting protected. They are only reenabled on the
return from NMI.
5) FRED guarantees full restore of ESP
6) FRED does not put a limitation on the vector space by design
because it uses a central entry points for kernel and user space
and the CPUstores the entry type (exception, trap, interrupt,
syscall) on the entry stack along with the vector number. The
entry code has to demultiplex this information, but this removes
the vector space restriction.
The first hardware implementations will still have the current
restricted vector space because lifting this limitation requires
further changes to the local APIC.
7) FRED stores the vector number and meta information on stack which
allows having more than one NMI vector in future hardware when the
required local APIC changes are in place.
The series implements the initial FRED support by:
- Reworking the existing entry and IDT handling infrastructure to
accomodate for the alternative entry mechanism.
- Expanding the stack frame to accomodate for the extra 16 bytes FRED
requires to store context and meta information
- Providing FRED specific C entry points for events which have
information pushed to the extended stack frame, e.g. #PF and #DB.
- Providing FRED specific C entry points for #NMI and #MCE
- Implementing the FRED specific ASM entry points and the C code to
demultiplex the events
- Providing detection and initialization mechanisms and the necessary
tweaks in context switching, GS BASE handling etc.
The FRED integration aims for maximum code reuse vs the existing IDT
implementation to the extent possible and the deviation in hot paths
like context switching are handled with alternatives to minimalize the
impact. The low level entry and exit paths are seperate due to the
extended stack frame and the hardware based GS BASE swichting and
therefore have no impact on IDT based systems.
It has been extensively tested on existing systems and on the FRED
simulation and as of now there are no outstanding problems"
* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/fred: Fix init_task thread stack pointer initialization
MAINTAINERS: Add a maintainer entry for FRED
x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
x86/fred: Invoke FRED initialization code to enable FRED
x86/fred: Add FRED initialization functions
x86/syscall: Split IDT syscall setup code into idt_syscall_init()
KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
x86/traps: Add sysvec_install() to install a system interrupt handler
x86/fred: FRED entry/exit and dispatch code
x86/fred: Add a machine check entry stub for FRED
x86/fred: Add a NMI entry stub for FRED
x86/fred: Add a debug fault entry stub for FRED
x86/idtentry: Incorporate definitions/declarations of the FRED entries
x86/fred: Make exc_page_fault() work for FRED
x86/fred: Allow single-step trap and NMI when starting a new task
x86/fred: No ESPFIX needed when FRED is enabled
...
|
|
Now that __num_cores_per_package and __num_threads_per_package are
available, cpuinfo::x86_max_cores and the related math all over the place
can be replaced with the ready to consume data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240213210253.176147806@linutronix.de
|
|
It's really a non-intuitive name. Rename it to __max_threads_per_core which
is obvious.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240213210253.011307973@linutronix.de
|
|
Switch it over to the new topology evaluation mechanism and remove the
random bits and pieces which are sprinkled all over the place.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.145745053@linutronix.de
|
|
AMD (ab)uses topology_die_id() to store the Node ID information and
topology_max_dies_per_pkg to store the number of nodes per package.
This collides with the proper processor die level enumeration which is
coming on AMD with CPUID 8000_0026, unless there is a correlation between
the two. There is zero documentation about that.
So provide new storage and new accessors which for now still access die_id
and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are
converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.956116738@linutronix.de
|
|
Now that the driver core can properly handle constant struct bus_type,
make mce_subsys a constant structure.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-x86-v1-1-4e7171be88e8@marliere.net
|
|
Like #DB, when occurred on different ring level, i.e., from user or kernel
context, #MCE needs to be handled on different stack: User #MCE on current
task stack, while kernel #MCE on a dedicated stack.
This is exactly how FRED event delivery invokes an exception handler: ring
3 event on level 0 stack, i.e., current task stack; ring 0 event on the
the FRED machine check entry stub doesn't do stack switch.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-26-xin3.li@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TDX updates from Dave Hansen:
"This contains the initial support for host-side TDX support so that
KVM can run TDX-protected guests. This does not include the actual
KVM-side support which will come from the KVM folks. The TDX host
interactions with kexec also needs to be ironed out before this is
ready for prime time, so this code is currently Kconfig'd off when
kexec is on.
The majority of the code here is the kernel telling the TDX module
which memory to protect and handing some additional memory over to it
to use to store TDX module metadata. That sounds pretty simple, but
the TDX architecture is rather flexible and it takes quite a bit of
back-and-forth to say, "just protect all memory, please."
There is also some code tacked on near the end of the series to handle
a hardware erratum. The erratum can make software bugs such as a
kernel write to TDX-protected memory cause a machine check and
masquerade as a real hardware failure. The erratum handling watches
out for these and tries to provide nicer user errors"
* tag 'x86_tdx_for_6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/virt/tdx: Make TDX host depend on X86_MCE
x86/virt/tdx: Disable TDX host support when kexec is enabled
Documentation/x86: Add documentation for TDX host support
x86/mce: Differentiate real hardware #MCs from TDX erratum ones
x86/cpu: Detect TDX partial write machine check erratum
x86/virt/tdx: Handle TDX interaction with sleep and hibernation
x86/virt/tdx: Initialize all TDMRs
x86/virt/tdx: Configure global KeyID on all packages
x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
x86/virt/tdx: Designate reserved areas for all TDMRs
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
x86/virt/tdx: Get module global metadata for module initialization
x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
x86/virt/tdx: Add skeleton to enable TDX on demand
x86/virt/tdx: Add SEAMCALL error printing for module initialization
x86/virt/tdx: Handle SEAMCALL no entropy error in common code
x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
x86/virt/tdx: Define TDX supported page sizes as macros
...
|
|
Add an Intel specific hook into machine_check_poll() to keep track of
per-CPU, per-bank corrected error logs (with a stub for the
CONFIG_MCE_INTEL=n case).
When a storm is observed the rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second.
When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove
the bank from the bitmap for polling, and change the polling rate back
to the default.
If a CPU with banks in storm mode is taken offline, the new CPU that
inherits ownership of those banks takes over management of storm(s) in
the inherited bank(s).
The cmci_discover() function was already very large. These changes
pushed it well over the top. Refactor with three helper functions to
bring it back under control.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231115195450.12963-4-tony.luck@intel.com
|
|
This is the core functionality to track CMCI storms at the machine check
bank granularity. Subsequent patches will add the vendor specific hooks
to supply input to the storm detection and take actions on the start/end
of a storm.
machine_check_poll() is called both by the CMCI interrupt code, and for
periodic polls from a timer. Add a hook in this routine to maintain
a bitmap history for each bank showing whether the bank logged an
corrected error or not each time it is polled.
In normal operation the interval between polls of these banks determines
how far to shift the history. The 64 bit width corresponds to about one
second.
When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm. The bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second. During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the
history covers just over a minute.
Declare a storm for that bank if the number of corrected interrupts seen
in that history is above some threshold (defined as 5 in this series,
could be tuned later if there is data to suggest a better value).
A storm on a bank ends if enough consecutive polls of the bank show no
corrected errors (defined as 30, may also change). That calls the CPU
vendor specific function to revert to normal operational mode, and
changes the polling rate back to the default.
[ bp: Massage. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231115195450.12963-3-tony.luck@intel.com
|
|
When a "storm" of corrected machine check interrupts (CMCI) is detected
this code mitigates by disabling CMCI interrupt signalling from all of
the banks owned by the CPU that saw the storm.
There are problems with this approach:
1) It is very coarse grained. In all likelihood only one of the banks
was generating the interrupts, but CMCI is disabled for all. This
means Linux may delay seeing and processing errors logged from other
banks.
2) Although CMCI stands for Corrected Machine Check Interrupt, it is
also used to signal when an uncorrected error is logged. This is
a problem because these errors should be handled in a timely manner.
Delete all this code in preparation for a finer grained solution.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20231115195450.12963-2-tony.luck@intel.com
|
|
The first few generations of TDX hardware have an erratum. Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.
Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.
== Background ==
Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.
This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.
== Problem ==
A partial write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.
To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.
== Solution ==
In the end, this issue is hard to trigger. Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.
Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:
[...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
[...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
...
[...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] Kernel panic - not syncing: Fatal local machine check
Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".
Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error". But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error. Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.
So keep the "Hardware Error". The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.
Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:
...
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
Adding this additional message requires determination of whether the
memory page is TDX private memory. There is no existing infrastructure
to do that. Add an interface to query the TDX module to fill this gap.
== Impact ==
This issue requires some kind of kernel bug to trigger.
TDX private memory should never be mapped UC/WC. A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory. It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.
MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map. It should also be
detectable with normal debug techniques.
The one place where this might get nasty would be if the CPU read data
then wrote back the same data. That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.
With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.
[ dhansen: changelog tweaks ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
|
|
Add HWID and McaType values for new SMCA bank types.
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231102114225.2006878-3-muralimk@amd.com
|
|
The long names of the SMCA banks are only used by the MCE decoder
module.
Move them out of the arch code and into the decoder module.
[ bp: Name the long names array "smca_long_names", drop local ptr in
decode_smca_error(), constify arrays. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-5-yazen.ghannam@amd.com
|
|
AMD systems generally allow MCA "simulation" where MCA registers can be
written with valid data and the full MCA handling flow can be tested by
software.
However, the platform on Scalable MCA systems, can prevent software from
writing data to the MCA registers. There is no architectural way to
determine this configuration. Therefore, the MCE injection module will
check for this behavior by writing and reading back a test status value.
This is done during module init, and the check can run on any CPU with
any valid MCA bank.
If MCA_STATUS writes are ignored by the platform, then there are no side
effects on the hardware state.
If the writes are not ignored, then the test status value will remain in
the hardware MCA_STATUS register. It is likely that the value will not
be overwritten by hardware or software, since the tested CPU and bank
are arbitrary. Therefore, the user may see a spurious, synthetic MCA
error reported whenever MCA is polled for this CPU.
Clear the test value immediately after writing it. It is very unlikely
that a valid MCA error is logged by hardware during the test. Errors
that cause an #MC won't be affected.
Fixes: 891e465a1bd8 ("x86/mce: Check whether writes to MCA_STATUS are getting ignored")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-2-yazen.ghannam@amd.com
|
|
mce_device_create() is called only from mce_cpu_online() which in turn
will be called iff MCA support is available. That is, at the time of
mce_device_create() call it's guaranteed that MCA support is available.
No need to duplicate this check so remove it.
[ bp: Massage commit message. ]
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231107165529.407349-1-nik.borisov@suse.com
|
|
Memory errors don't happen very often, especially fatal ones. However,
in large-scale scenarios such as data centers, that probability
increases with the amount of machines present.
When a fatal machine check happens, mce_panic() is called based on the
severity grading of that error. The page containing the error is not
marked as poison.
However, when kexec is enabled, tools like makedumpfile understand when
pages are marked as poison and do not touch them so as not to cause
a fatal machine check exception again while dumping the previous
kernel's memory.
Therefore, mark the page containing the error as poisoned so that the
kexec'ed kernel can avoid accessing the page.
[ bp: Rewrite commit message and comment. ]
Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: https://lore.kernel.org/r/20231014051754.3759099-1-zhiquan1.li@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner:
- Limit the hardcoded topology quirk for Hygon CPUs to those which have
a model ID less than 4.
The newer models have the topology CPUID leaf 0xB correctly
implemented and are not affected.
- Make SMT control more robust against enumeration failures
SMT control was added to allow controlling SMT at boottime or
runtime. The primary purpose was to provide a simple mechanism to
disable SMT in the light of speculation attack vectors.
It turned out that the code is sensible to enumeration failures and
worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
which means the primary thread mask is not set up correctly. By
chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes
the hotplug control stay at its default value "enabled". So the mask
is never evaluated.
The ongoing rework of the topology evaluation caused XEN/PV to end up
with smp_num_siblings == 1, which sets the SMT control to "not
supported" and the empty primary thread mask causes the hotplug core
to deny the bringup of the APS.
Make the decision logic more robust and take 'not supported' and 'not
implemented' into account for the decision whether a CPU should be
booted or not.
- Fake primary thread mask for XEN/PV
Pretend that all XEN/PV vCPUs are primary threads, which makes the
usage of the primary thread mask valid on XEN/PV. That is consistent
with because all of the topology information on XEN/PV is fake or
even non-existent.
- Encapsulate topology information in cpuinfo_x86
Move the randomly scattered topology data into a separate data
structure for readability and as a preparatory step for the topology
evaluation overhaul.
- Consolidate APIC ID data type to u32
It's fixed width hardware data and not randomly u16, int, unsigned
long or whatever developers decided to use.
- Cure the abuse of cpuinfo for persisting logical IDs.
Per CPU cpuinfo is used to persist the logical package and die IDs.
That's really not the right place simply because cpuinfo is subject
to be reinitialized when a CPU goes through an offline/online cycle.
Use separate per CPU data for the persisting to enable the further
topology management rework. It will be removed once the new topology
management is in place.
- Provide a debug interface for inspecting topology information
Useful in general and extremly helpful for validating the topology
management rework in terms of correctness or "bug" compatibility.
* tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
x86/cpu: Provide debug interface
x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
x86/apic: Use u32 for [gs]et_apic_id()
x86/apic: Use u32 for phys_pkg_id()
x86/apic: Use u32 for cpu_present_to_apicid()
x86/apic: Use u32 for check_apicid_used()
x86/apic: Use u32 for APIC IDs in global data
x86/apic: Use BAD_APICID consistently
x86/cpu: Move cpu_l[l2]c_id into topology info
x86/cpu: Move logical package and die IDs into topology info
x86/cpu: Remove pointless evaluation of x86_coreid_bits
x86/cpu: Move cu_id into topology info
x86/cpu: Move cpu_core_id into topology info
hwmon: (fam15h_power) Use topology_core_id()
scsi: lpfc: Use topology_core_id()
x86/cpu: Move cpu_die_id into topology info
x86/cpu: Move phys_proc_id into topology info
x86/cpu: Encapsulate topology information in cpuinfo_x86
...
|
|
Move Intel-specific checks into a helper function.
Explicitly use "bool" for return type.
No functional change intended.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-4-yazen.ghannam@amd.com
|
|
Currently, all valid MCA_ADDR values are assumed to be usable on AMD
systems. However, this is not correct in most cases. Notifiers expecting
usable addresses may then operate on inappropriate values.
Define a helper function to do AMD-specific checks for a usable memory
address. List out all known cases.
[ bp: Tone down the capitalized words. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-3-yazen.ghannam@amd.com
|
|
Define helper functions for legacy and SMCA systems in order to reuse
individual checks in later changes.
Describe what each function is checking for, and correct the XEC bitmask
for SMCA.
No functional change intended.
[ bp: Use "else in amd_mce_is_memory_error() to make the conditional
balanced, for readability. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613141142.36801-2-yazen.ghannam@amd.com
|
|
Rename it to pkg_id which is the terminology used in the kernel.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.329006989@linutronix.de
|
|
The topology related information is randomly scattered across cpuinfo_x86.
Create a new structure cpuinfo_topo and move in a first step initial_apicid
and apicid into it.
Aside of being better readable this is in preparation for replacing the
horribly fragile CPU topology evaluation code further down the road.
Consolidate APIC ID fields to u32 as that represents the hardware type.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.269787744@linutronix.de
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Dave Hansen:
"This includes a very thorough rework of the 'struct apic' handlers.
Quite a variety of them popped up over the years, especially in the
32-bit days when odd apics were much more in vogue.
The end result speaks for itself, which is a removal of a ton of code
and static calls to replace indirect calls.
If there's any breakage here, it's likely to be around the 32-bit
museum pieces that get light to no testing these days.
Summary:
- Rework apic callbacks, getting rid of unnecessary ones and
coalescing lots of silly duplicates.
- Use static_calls() instead of indirect calls for apic->foo()
- Tons of cleanups an crap removal along the way"
* tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
x86/apic: Turn on static calls
x86/apic: Provide static call infrastructure for APIC callbacks
x86/apic: Wrap IPI calls into helper functions
x86/apic: Mark all hotpath APIC callback wrappers __always_inline
x86/xen/apic: Mark apic __ro_after_init
x86/apic: Convert other overrides to apic_update_callback()
x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb()
x86/apic: Provide apic_update_callback()
x86/xen/apic: Use standard apic driver mechanism for Xen PV
x86/apic: Provide common init infrastructure
x86/apic: Wrap apic->native_eoi() into a helper
x86/apic: Nuke ack_APIC_irq()
x86/apic: Remove pointless arguments from [native_]eoi_write()
x86/apic/noop: Tidy up the code
x86/apic: Remove pointless NULL initializations
x86/apic: Sanitize APIC ID range validation
x86/apic: Prepare x2APIC for using apic::max_apic_id
x86/apic: Simplify X2APIC ID validation
x86/apic: Add max_apic_id member
x86/apic: Wrap APIC ID validation into an inline
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Add a quirk for AMD Zen machines where Instruction Fetch unit poison
consumption MCEs are not delivered synchronously but still within the
same context, which can lead to erroneously increased error severity
and unneeded kernel panics
- Do not log errors caught by polling shared MCA banks as they
materialize as duplicated error records otherwise
* tag 'ras_core_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/MCE: Always save CS register on AMD Zen IF Poison errors
x86/mce: Prevent duplicate error records
|
|
The Instruction Fetch (IF) units on current AMD Zen-based systems do not
guarantee a synchronous #MC is delivered for poison consumption errors.
Therefore, MCG_STATUS[EIPV|RIPV] will not be set. However, the
microarchitecture does guarantee that the exception is delivered within
the same context. In other words, the exact rIP is not known, but the
context is known to not have changed.
There is no architecturally-defined method to determine this behavior.
The Code Segment (CS) register is always valid on such IF unit poison
errors regardless of the value of MCG_STATUS[EIPV|RIPV].
Add a quirk to save the CS register for poison consumption from the IF
unit banks.
This is needed to properly determine the context of the error.
Otherwise, the severity grading function will assume the context is
IN_KERNEL due to the m->cs value being 0 (the initialized value). This
leads to unnecessary kernel panics on data poison errors due to the
kernel believing the poison consumption occurred in kernel context.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230814200853.29258-1-yazen.ghannam@amd.com
|
|
Move them to one place so the static call conversion gets simpler.
No functional change.
[ dhansen: merge against recent x86/apic changes ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
|
|
Yet another wrapper of a wrapper gone along with the outdated comment
that this compiles to a single instruction.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
|
|
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
Therefore, the threshold_bank structure for bank 4, and its threshold_block
structures, will be initialized once at boot time. And the kobject for the
shared bank will be added to each of the CPUs that share it. Furthermore,
the threshold_blocks for the shared bank will be added again to the bank's
kobject. These additions will increase the refcount for the bank's kobject.
For example, a shared bank with two blocks and shared across two CPUs will
be set up like this:
CPU0 init
bank create and add; bank refcount = 1; threshold_create_bank()
block 0 init and add; bank refcount = 2; allocate_threshold_blocks()
block 1 init and add; bank refcount = 3; allocate_threshold_blocks()
CPU1 init
bank add; bank refcount = 3; threshold_create_bank()
block 0 add; bank refcount = 4; __threshold_add_blocks()
block 1 add; bank refcount = 5; __threshold_add_blocks()
Currently in threshold_remove_bank(), if the bank is shared then
__threshold_remove_blocks() is called. Here the shared bank's kobject and
the bank's blocks' kobjects are deleted. This is done on the first call
even while the structures are still shared. Subsequent calls from other
CPUs that share the structures will attempt to delete the kobjects.
During kobject_del(), kobject->sd is removed. If the kobject is not part of
a kset with default_groups, then subsequent kobject_del() calls seem safe
even with kobject->sd == NULL.
Originally, the AMD MCA thresholding structures did not use default_groups.
And so the above behavior was not apparent.
However, a recent change implemented default_groups for the thresholding
structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
code path. In this case, the first kobject_del() may succeed and remove
kobject->sd. But subsequent kobject_del() calls will give a WARNing in
kernfs_remove_by_name_ns() since kobject->sd == NULL.
Use kobject_put() on the shared bank's kobject when "removing" blocks. This
decrements the bank's refcount while keeping kobjects enabled until the
bank is no longer shared. At that point, kobject_put() will be called on
the blocks which drives their refcount to 0 and deletes them and also
decrementing the bank's refcount. And finally kobject_put() will be called
on the bank driving its refcount to 0 and deleting it.
The same example above:
CPU1 shutdown
bank is shared; bank refcount = 5; threshold_remove_bank()
block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks()
block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks()
CPU0 shutdown
bank is no longer shared; bank refcount = 3; threshold_remove_bank()
block 0 put block; bank refcount = 2; deallocate_threshold_blocks()
block 1 put block; bank refcount = 1; deallocate_threshold_blocks()
put bank; bank refcount = 0; threshold_remove_bank()
Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
|
|
A legitimate use case of the MCA infrastructure is to have the firmware
log all uncorrectable errors and also, have the OS see all correctable
errors.
The uncorrectable, UCNA errors are usually configured to be reported
through an SMI. CMCI, which is the correctable error reporting
interrupt, uses SMI too and having both enabled, leads to unnecessary
overhead.
So what ends up happening is, people disable CMCI in the wild and leave
on only the UCNA SMI.
When CMCI is disabled, the MCA infrastructure resorts to polling the MCA
banks. If a MCA MSR is shared between the logical threads, one error
ends up getting logged multiple times as the polling runs on every
logical thread.
Therefore, introduce locking on the Intel side of the polling routine to
prevent such duplicate error records from appearing.
Based on a patch by Aristeu Rozanski <aris@ruivo.org>.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Aristeu Rozanski <aris@ruivo.org>
Link: https://lore.kernel.org/r/20230515143225.GC4090740@cathedrallabs.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
|
|
The MI200 (Aldebaran) series of devices introduced a new SMCA bank type
for Unified Memory Controllers. The MCE subsystem already has support
for this new type. The MCE decoder module will decode the common MCA
error information for the new bank type, but it will not pass the
information to the AMD64 EDAC module for detailed memory error decoding.
Have the MCE decoder module recognize the new bank type as an SMCA UMC
memory error and pass the MCA information to AMD64 EDAC.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515113537.1052146-3-muralimk@amd.com
|
|
Now that we have raw_atomic*_<op>() definitions, there's no need to use
arch_atomic*_<op>() definitions outside of the low-level atomic
definitions.
Move treewide users of arch_atomic*_<op>() over to the equivalent
raw_atomic*_<op>().
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com
|
|
Make sure that machine check errors with a usable address are properly
marked as poison.
This is needed for errors that occur on memory which have
MCG_STATUS[RIPV] clear - i.e., the interrupted process cannot be
restarted reliably. One example is data poison consumption through the
instruction fetch units on AMD Zen-based systems.
The MF_MUST_KILL flag is passed to memory_failure() when
MCG_STATUS[RIPV] is not set. So the associated process will still be
killed. What this does, practically, is get rid of one more check to
kill_current_task with the eventual goal to remove it completely.
Also, make the handling identical to what is done on the notifier path
(uc_decode_notifier() does that address usability check too).
The scenario described above occurs when hardware can precisely identify
the address of poisoned memory, but execution cannot reliably continue
for the interrupted hardware thread.
[ bp: Massage commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230322005131.174499-1-tony.luck@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Just cleanups and fixes this time around: make threshold_ktype const,
an objtool fix and use proper size for a bitmap
* tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/MCE/AMD: Use an u64 for bank_map
x86/mce: Always inline old MCA stubs
x86/MCE/AMD: Make kobj_type structure constant
|
|
Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see
a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64").
However, the bank_map which contains a bitfield of which banks to
initialize is of type unsigned int and that overflows when those bit
numbers are >= 32, leading to UBSAN complaining correctly:
UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38
shift exponent 32 is too large for 32-bit type 'int'
Change the bank_map to a u64 and use the proper BIT_ULL() macro when
modifying bits in there.
[ bp: Rewrite commit message. ]
Fixes: a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64")
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127151601.1068324-1-muralimk@amd.com
|
|
A recent change introduced a flag to queue up errors found during
boot-time polling. These errors will be processed during late init once
the MCE subsystem is fully set up.
A number of sysfs updates call mce_restart() which goes through a subset
of the CPU init flow. This includes polling MCA banks and logging any
errors found. Since the same function is used as boot-time polling,
errors will be queued. However, the system is now past late init, so the
errors will remain queued until another error is found and the workqueue
is triggered.
Call mce_schedule_work() at the end of mce_restart() so that queued
errors are processed.
Fixes: 3bff147b187d ("x86/mce: Defer processing of early errors")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230301221420.2203184-1-yazen.ghannam@amd.com
|
|
The stubs for the ancient MCA support (CONFIG_X86_ANCIENT_MCE) are
normally optimized away on 64-bit builds. However, an allmodconfig one
causes the compiler to add sanitizer calls gunk into them and they exist
as constprop calls. Which objtool then complains about:
vmlinux.o: warning: objtool: do_machine_check+0xad8: call to \
pentium_machine_check.constprop.0() leaves .noinstr.text section
due to them missing noinstr. One could tag them "noinstr" but what
should really happen is, they should be forcefully inlined so that all
that gunk gets optimized away and the warning doesn't even have a chance
to fire.
Do so.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230222191054.4701-1-bp@alien8.de
|
|
Since
ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definition to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230217-kobj_type-mce-amd-v1-1-40ef94816444@weissschuh.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Add support for reporting more bits of the physical address on error,
on newer AMD CPUs
- Mask out bits which don't belong to the address of the error being
reported
* tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Mask out non-address bits from machine check bank
x86/mce: Add support for Extended Physical Address MCA changes
x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
|
|
Systems that support various memory encryption schemes (MKTME, TDX, SEV)
use high order physical address bits to indicate which key should be
used for a specific memory location.
When a memory error is reported, some systems may report those key
bits in the IA32_MCi_ADDR machine check MSR.
The Intel SDM has a footnote for the contents of the address register
that says: "Useful bits in this field depend on the address methodology
in use when the register state is saved."
AMD Processor Programming Reference has a more explicit description
of the MCA_ADDR register:
"For physical addresses, the most significant bit is given by
Core::X86::Cpuid::LongModeInfo[PhysAddrSize]."
Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical
address bits within the machine check bank address register. Use this
mask for recoverable machine check handling and in the EDAC driver to
ignore any key bits that may be present.
[ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ]
Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reported-by: Fan Du <fan.du@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
|
|
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.
Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/202212031419324523731@zte.com.cn
|
|
Newer AMD CPUs support more physical address bits.
That is, the MCA_ADDR registers on Scalable MCA systems contain the
ErrorAddr in bits [56:0] instead of [55:0]. Hence, the existing LSB field
from bits [61:56] in MCA_ADDR must be moved around to accommodate the
larger ErrorAddr size.
MCA_CONFIG[McaLsbInStatusSupported] indicates this change. If set, the
LSB field will be found in MCA_STATUS rather than MCA_ADDR.
Each logical CPU has unique MCA bank in hardware and is not shared with
other logical CPUs. Additionally, on SMCA systems, each feature bit may
be different for each bank within same logical CPU.
Check for MCA_CONFIG[McaLsbInStatusSupported] for each MCA bank and for
each CPU.
Additionally, all MCA banks do not support maximum ErrorAddr bits in
MCA_ADDR. Some banks might support fewer bits but the remaining bits are
marked as reserved.
[ Yazen: Rebased and fixed up formatting.
bp: Massage comments. ]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221206173607.1185907-5-yazen.ghannam@amd.com
|
|
Move MCA_ADDR[ErrorAddr] extraction into a separate helper function. This
will be further refactored to support extended ErrorAddr bits in MCA_ADDR
in newer AMD CPUs.
[ bp: Massage. ]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/all/20220225193342.215780-3-Smita.KoralahalliChannabasappa@amd.com/
|
|
mce_severity_intel() has a special case to promote UC and AR errors
in kernel context to PANIC severity.
The "AR" case is already handled with separate entries in the severity
table for all instruction fetch errors, and those data fetch errors that
are not in a recoverable area of the kernel (i.e. have an extable fixup
entry).
Add an entry to the severity table for UC errors in kernel context that
reports severity = PANIC. Delete the special case code from
mce_severity_intel().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220922195136.54575-2-tony.luck@intel.com
|
|
AMD's MCA Thresholding feature counts errors of all severity levels, not
just correctable errors. If a deferred error causes the threshold limit
to be reached (it was the error that caused the overflow), then both a
deferred error interrupt and a thresholding interrupt will be triggered.
The order of the interrupts is not guaranteed. If the threshold
interrupt handler is executed first, then it will clear MCA_STATUS for
the error. It will not check or clear MCA_DESTAT which also holds a copy
of the deferred error. When the deferred error interrupt handler runs it
will not find an error in MCA_STATUS, but it will find the error in
MCA_DESTAT. This will cause two errors to be logged.
Check for deferred errors when handling a threshold interrupt. If a bank
contains a deferred error, then clear the bank's MCA_DESTAT register.
Define a new helper function to do the deferred error check and clearing
of MCA_DESTAT.
[ bp: Simplify, convert comment to passive voice. ]
Fixes: 37d43acfd79f ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220621155943.33623-1-yazen.ghannam@amd.com
|
|
When memory poison consumption machine checks fire, MCE notifier
handlers like nfit_handle_mce() record the impacted physical address
range which is reported by the hardware in the MCi_MISC MSR. The error
information includes data about blast radius, i.e. how many cachelines
did the hardware determine are impacted. A recent change
7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
updated nfit_handle_mce() to stop hard coding the blast radius value of
1 cacheline, and instead rely on the blast radius reported in 'struct
mce' which can be up to 4K (64 cachelines).
It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than reading the blast
radius from the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com
Link: https://lore.kernel.org/r/20220826233851.1319100-1-jane.chu@oracle.com
|