summaryrefslogtreecommitdiff
path: root/drivers/edac/i10nm_base.c
AgeCommit message (Collapse)Author
2025-07-15EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handlingWang Haoran
snprintf() is fragile when its return value will be used to append additional data to a buffer. Use scnprintf() instead. Signed-off-by: Wang Haoran <haoranwangsec@gmail.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20250715131700.1092720-1-haoranwangsec@gmail.com
2025-07-07EDAC/i10nm: Add Intel Granite Rapids-D supportQiuxu Zhuo
The Granite Rapids-D CPU model uses memory controller registers similar to those of the Granite Rapids server CPU but with a different memory controller MMIO base. Add the Granite Rapids-D CPU model ID and use the new memory controller MMIO base for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: VikasX Chougule <vikasx.chougule@intel.com> Link: https://lore.kernel.org/r/20250704151609.7833-2-qiuxu.zhuo@intel.com
2025-04-24EDAC/i10nm: Fix the bitwise operation between variables of different sizesQiuxu Zhuo
The tool of Smatch static checker reported the following warning: drivers/edac/i10nm_base.c:364 show_retry_rd_err_log() warn: should bitwise negate be 'ullong'? This warning was due to the bitwise NOT/AND operations between 'status_mask' (a u32 type) and 'log' (a u64 type), which resulted in the high 32 bits of 'log' were cleared. This was a false positive warning, as only the low 32 bits of 'log' was written to the first RRL memory controller register (a u32 type). To improve code sanity, fix this warning by changing 'status_mask' to a u64 type, ensuring it matches the size of 'log' for bitwise operations. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/aAih0KmEVq7ch6v2@stanley.mountain/ Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250424081454.2952632-1-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids serverQiuxu Zhuo
Compared to previous generations, Granite Rapids defines the RRL control bits {en_patspr, noover, en} in different positions, adds an extra RRL set for the new mode of the first patrol-scrub read error, and extends the number of CORRERRCNT registers from 4 to 8, encoding one counter per CORRERRCNT register. Add a Granite Rapids reg_rrl configuration table and adjust the code to accommodate the differences mentioned above for RRL support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-8-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log()Qiuxu Zhuo
Make the {valid bit, overwritten status, number} of RRL registers and the {number, offsets, widths} of per-channel CORRERRCNT registers configurable. Refactor show_retry_rd_err_log() to use the configurable fields of struct reg_rrl, making the code more scalable and simpler. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-7-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log()Qiuxu Zhuo
Refactor enable_retry_rd_err_log() using helper functions for both DDR and HBM, making the RRL control bits configurable instead of hard-coded. Additionally, explicitly define the four RRL modes for better readability. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-6-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Structure the per-channel RRL registersQiuxu Zhuo
As the number of RRL (retry_rd_err_log) registers per memory channel increases, the positions of the RRL control bits and the widths of the RRL registers vary across different CPU generations. Adding RRL support for a new CPU requires handling these differences throughout the RRL-related code. Structure the offsets, widths, control bit positions, set numbers, modes, etc., of the per-channel RRL registers and make them configurable to facilitate easier RRL support for new CPUs. No functional changes are intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-5-qiuxu.zhuo@intel.com
2025-04-17EDAC/i10nm: Explicitly set the modes of the RRL register setsQiuxu Zhuo
The i10nm_edac driver uses the default modes (either patrol scrub read or on-demand read) of the RRL register sets configured by the BIOS. Explicitly set the modes during the loading of the i10nm_edac driver with the module parameter retry_rd_err_log=2. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-4-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0Qiuxu Zhuo
When enabling the retry_rd_err_log (RRL) feature during the loading of the i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL control mode), the default values of the control bits of RRL are saved so that they can be restored during the unloading of the driver. In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo channel 0 during the loading of the driver, resulting in the loss of saved RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to be wrongly restored with the values from pseudo channel 1 when unloading the driver. Fix this issue by creating two separate groups of RRL control registers per channel to save default RRL settings of two {sub-,pseudo-}channels. Fixes: acd4cf68fefe ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-3-qiuxu.zhuo@intel.com
2025-02-20EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald RapidsQiuxu Zhuo
When doing error injection to some memory DIMMs on certain Intel Emerald Rapids servers, the i10nm_edac missed error reports for some memory DIMMs. Certain BIOS configurations may hide some memory controllers, and the i10nm_edac doesn't enumerate these hidden memory controllers. However, the ADXL decodes memory errors using memory controller physical indices even if there are hidden memory controllers. Therefore, the memory controller physical indices reported by the ADXL may mismatch the logical indices enumerated by the i10nm_edac, resulting in missed error reports for some memory DIMMs. Fix this issue by creating a mapping table from memory controller physical indices (used by the ADXL) to logical indices (used by the i10nm_edac) and using it to convert the physical indices to the logical indices during the error handling process. Fixes: c545f5e41225 ("EDAC/i10nm: Skip the absent memory controllers") Reported-by: Kevin Chang <kevin1.chang@intel.com> Tested-by: Kevin Chang <kevin1.chang@intel.com> Reported-by: Thomas Chen <Thomas.Chen@intel.com> Tested-by: Thomas Chen <Thomas.Chen@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com
2025-01-21Merge tag 'x86_cpu_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Remove the less generic CPU matching infra around struct x86_cpu_desc and use the generic struct x86_cpu_id thing - Remove magic naked numbers for CPUID functions and use proper defines of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree - Smaller cleanups and improvements * tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Make all all CPUID leaf names consistent x86/fpu: Remove unnecessary CPUID level check x86/fpu: Move CPUID leaf definitions to common code x86/tsc: Remove CPUID "frequency" leaf magic numbers. x86/tsc: Move away from TSC leaf magic numbers x86/cpu: Move TSC CPUID leaf definition x86/cpu: Refresh DCA leaf reading code x86/cpu: Remove unnecessary MwAIT leaf checks x86/cpu: Use MWAIT leaf definition x86/cpu: Move MWAIT leaf definition to common header x86/cpu: Remove 'x86_cpu_desc' infrastructure x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id' x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id' x86/cpu: Expose only stepping min/max interface x86/cpu: Introduce new microcode matching helper x86/cpufeature: Document cpu_feature_enabled() as the default to use x86/paravirt: Remove the WBINVD callback x86/cpufeatures: Free up unused feature bits
2024-12-17x86/cpu: Expose only stepping min/max interfaceDave Hansen
The x86_match_cpu() infrastructure can match CPU steppings. Since there are only 16 possible steppings, the matching infrastructure goes all out and stores the stepping match as a bitmap. That means it can match any possible steppings in a single list entry. Fun. But it exposes this bitmap to each of the X86_MATCH_*() helpers when none of them really need a bitmap. It makes up for this by exporting a helper (X86_STEPPINGS()) which converts a contiguous stepping range into the bitmap which every single user leverages. Instead of a bitmap, have the main helper for this sort of thing (X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up actually being even more compact than before. Leave the helper in place (renamed to __X86_STEPPINGS()) to make it more clear what is going on instead of just having a random GENMASK() in the middle of an already complicated macro. One oddity that I hit was this macro: X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues) It *could* have been converted over to take a min/max stepping value for each entry. But that would have been a bit too verbose and would prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0) from sticking out. Instead, just have it take a *maximum* stepping and imply that the match is from 0=>max_stepping. This is functional for all the cases now and also retains the nice property of having INTEL_COMETLAKE_L stepping 0 stick out like a sore thumb. skx_cpuids[] is goofy. It uses the stepping match but encodes all possible steppings. Just use a normal, non-stepping match helper. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
2024-12-13EDAC/{i10nm,skx,skx_common}: Support UV systemsKyle Meyer
The 3-bit source IDs in PCI configuration space registers, used to map devices to sockets, are limited to 8 unique IDs, and each ID is local to a UPI/QPI domain. Source IDs cannot be used to map devices to sockets on UV systems because they can exceed 8 sockets and have multiple UPI/QPI domains with identical, repeating source IDs. Use NUMA information to get package IDs instead of source IDs on UV systems, and use package/source IDs to name IMC information structures. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/all/20241213012549.43099-1-kyle.meyer@hpe.com/
2024-12-09EDAC/i10nm: Add Intel Clearwater Forest server supportQiuxu Zhuo
Clearwater Forest is the successor to Sierra Forest. Add Clearwater Forest CPU model ID for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Link: https://lore.kernel.org/r/20241203022038.72873-1-qiuxu.zhuo@intel.com
2024-10-23EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicatorQiuxu Zhuo
The Granite Rapids CPUs with Flat2LM memory configurations may mistakenly report near-memory errors as far-memory errors, resulting in the invalid decoded ADXL results: EDAC skx: Bad imc -1 Fix this incorrect far-memory error source indicator by prefetching the decoded far-memory controller ID, and adjust the error source indicator to near-memory if the far-memory controller ID is invalid. Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com> Link: https://lore.kernel.org/r/20241015072236.24543-3-qiuxu.zhuo@intel.com
2024-09-03EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5Qiuxu Zhuo
The configuration flag 'res_config->support_ddr5 = true' sufficiently indicates DDR5 memory support for Sapphire Rapids and Granite Rapids. Additionally, the i10nm_edac driver doesn't need to use the AMAP register for setting the 'fine_grain_bank' of each DIMM. Therefore, remove the AMAP register for determining DDR5. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240829061309.57738-1-qiuxu.zhuo@intel.com
2024-09-03EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_commonQiuxu Zhuo
Commit afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module") made skx_common.o a separate module. With skx_common.o now a separate module, move the common debug code setup_{skx,i10nm}_debug() and teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to reduce code duplication. Additionally, prefix these function names with 'skx' to maintain consistency with other names in the file. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240829055101.56245-1-qiuxu.zhuo@intel.com
2024-05-28EDAC/i10nm: Switch to new Intel CPU model definesTony Luck
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-36-tony.luck@intel.com
2024-02-01EDAC/i10nm: Add Intel Grand Ridge micro-server supportQiuxu Zhuo
The Grand Ridge CPU model uses similar memory controller registers with Granite Rapids server. Add Grand Ridge CPU model ID for EDAC support. Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240129062040.60809-3-qiuxu.zhuo@intel.com
2023-08-30Merge tag 'edac_updates_for_v6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull intel EDAC fixes from Tony Luck: - Old igen6 driver could lose pending events during initialization - Sapphire Rapids workstations have fewer memory controllers than their bigger siblings. This confused the driver. * tag 'edac_updates_for_v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/igen6: Fix the issue of no error events EDAC/i10nm: Skip the absent memory controllers
2023-08-09x86/cpu: Fix Crestmont uarchPeter Zijlstra
Sierra Forest and Grand Ridge are both E-core only using Crestmont micro-architecture, They fit the pre-existing naming scheme prefectly fine, adhere to it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230807150405.757666627@infradead.org
2023-07-24EDAC/i10nm: Skip the absent memory controllersQiuxu Zhuo
Some Sapphire Rapids workstations' absent memory controllers still appear as PCIe devices that fool the i10nm_edac driver and result in "shift exponent -66 is negative" call traces from skx_get_dimm_info(). Skip the absent memory controllers to avoid the call traces. Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Closes: https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2 Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Reported-by: Koba Ko <koba.ko@canonical.com> Closes: https://lore.kernel.org/linux-edac/SA1PR11MB71305B71CCCC3D9305835202892AA@SA1PR11MB7130.namprd11.prod.outlook.com/T/#t Tested-by: Koba Ko <koba.ko@canonical.com> Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20230710013232.59712-1-qiuxu.zhuo@intel.com
2023-04-10EDAC/i10nm: Add Intel Sierra Forest server supportQiuxu Zhuo
The Sierra Forest CPU model uses similar memory controller registers as Granite Rapids server. Add Sierra Forest CPU model ID for EDAC support. Tested-by: Li Zhang <li4.zhang@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20230410131531.11914-1-qiuxu.zhuo@intel.com
2023-02-08EDAC/i10nm: Add driver decoder for Sapphire Rapids serverYouquan Song
Intel SDM (December 2022) vol3B 17.13.2 contains IMC MC error codes for Sapphire Rapids. Current i10nm_edac only supports firmware decoder (ACPI DSM methods) for Sapphire Rapids. So add the driver decoder (decoding DDR memory errors via extracting error information from the IMC MC error codes) for Sapphire Rapids for better decoding performance. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-01-25EDAC/i10nm: Add Intel Granite Rapids server supportQiuxu Zhuo
The Granite Rapids CPU model uses similar memory controller registers as Sapphire Rapids server but with some different configurations: - Various memory controller numbers for different Granite Rapids CPUs. So detect the number of present memory controllers at run time. - Different MMIO offsets of memory controllers. - Different triples of bus/dev/fun of some PCI devices used in i10nm_edac. Add above configurations and Granite Rapids CPU model ID for EDAC support. [Tony: Fixed 2 typos s/strcture/structure/] Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-25EDAC/i10nm: Make more configurations CPU model specificQiuxu Zhuo
The numbers of memory controllers per socket, channels per memory controller, DIMMs per channel and the triples of bus/device/function of PCI devices used in i10nm_edac can be CPU model specific. Add new fields to the structure res_config for above numbers and triples to make them CPU model specific. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-25EDAC/i10nm: Add Intel Emerald Rapids server supportQiuxu Zhuo
The Emerald Rapids CPU model uses similar memory controller registers as Sapphire Rapids server. Add Emerald Rapids CPU model number ID for EDAC support. Tested-by: Li Zhang <li4.zhang@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2022-12-12Merge branches 'edac-ghes' and 'edac-misc' into edac-updates-for-v6.2Borislav Petkov (AMD)
Combine all queued EDAC changes for submission into v6.2: * ras/edac-ghes: EDAC/igen6: Return the correct error type when not the MC owner apei/ghes: Use xchg_release() for updating new cache slot instead of cmpxchg() EDAC: Check for GHES preference in the chipset-specific EDAC drivers EDAC/ghes: Make ghes_edac a proper module EDAC/ghes: Prepare to make ghes_edac a proper module EDAC/ghes: Add a notifier for reporting memory errors efi/cper: Export several helpers for ghes_edac to use * ras/edac-misc: EDAC/i10nm: fix refcount leak in pci_get_dev_wrapper() EDAC/i5400: Fix typo in comment: vaious -> various EDAC/mc_sysfs: Increase legacy channel support to 12 MAINTAINERS: Make Mauro EDAC reviewer MAINTAINERS: Make Manivannan Sadhasivam the maintainer of qcom_edac EDAC/i5000: Mark as BROKEN Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2022-11-28EDAC/i10nm: fix refcount leak in pci_get_dev_wrapper()Yang Yingliang
As the comment of pci_get_domain_bus_and_slot() says, it returns a PCI device with refcount incremented, so it doesn't need to call an extra pci_dev_get() in pci_get_dev_wrapper(), and the PCI device needs to be put in the error path. Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20221128065512.3572550-1-yangyingliang@huawei.com
2022-10-21EDAC: Check for GHES preference in the chipset-specific EDAC driversJia He
Call ghes_get_devices() to check whether ghes_edac should be used on the platform where it is preferred over the corresponding chipset-specific EDAC driver. Unlike the existing edac_get_owner() check, the ghes_get_devices() check works independent to the module_init ordering. [ bp: Massage. ] Suggested-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221010023559.69655-6-justin.he@arm.com
2022-09-23EDAC/i10nm: Print an extra register set of retry_rd_err_logQiuxu Zhuo
Sapphire Rapids server adds an extra register set for logging more retry_rd_err_log data. So add code to print the extra register set. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com
2022-09-23EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBMQiuxu Zhuo
An HBM memory channel is divided into two pseudo channels. Each pseudo channel has its own retry_rd_err_log registers. Retrieve and print retry_rd_err_log registers of the HBM pseudo channel if the memory error is from HBM. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com
2022-09-08EDAC/i10nm: Add driver decoder for Ice Lake and Tremont CPUsYouquan Song
Current i10nm_edac only supports firmware decoder (ACPI DSM methods). MCA bank registers of Ice Lake or Tremont CPUs contain the information to decode DDR memory errors. To get better decoding performance, add the driver decoder (decoding DDR memory errors via extracting error information from MCA bank registers) for Ice Lake and Tremont CPUs. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/
2022-01-04EDAC/i10nm: Release mdev/mbase when failing to detect HBMQiuxu Zhuo
On systems without HBM (High Bandwidth Memory) mdev/mbase are not released/unmapped. Add the code to release mdev/mbase when failing to detect HBM. [Tony: re-word commit message] Cc: <stable@vger.kernel.org> Fixes: c945088384d0 ("EDAC/i10nm: Add support for high bandwidth memory") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20211224091126.1246-1-qiuxu.zhuo@intel.com
2021-08-23EDAC/i10nm: Retrieve and print retry_rd_err_log registersYouquan Song
Retrieve and print retry_rd_err_log registers like the earlier change: commit e80634a75aba ("EDAC, skx: Retrieve and print retry_rd_err_log registers") This is a little trickier than on Skylake because of potential interference with BIOS use of the same registers. The default behavior is to ignore these registers. A module parameter retry_rd_err_log(default=0) controls the mode of operation: - 0=off : Default. - 1=bios : Linux doesn't reset any control bits, but just reports values. This is "no harm" mode, but it may miss reporting some data. - 2=linux: Linux tries to take control and resets mode bits, clears valid/UC bits after reading. This should be more reliable (especially if BIOS interference is reduced by disabling eMCA reporting mode in BIOS setup). Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210818175701.1611513-3-tony.luck@intel.com
2021-08-23EDAC/i10nm: Fix NVDIMM detectionQiuxu Zhuo
MCDDRCFG is a per-channel register and uses bit{0,1} to indicate the NVDIMM presence on DIMM slot{0,1}. Current i10nm_edac driver wrongly uses MCDDRCFG as per-DIMM register and fails to detect the NVDIMM. Fix it by reading MCDDRCFG as per-channel register and using its bit{0,1} to check whether the NVDIMM is populated on DIMM slot{0,1}. Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Reported-by: Fan Du <fan.du@intel.com> Tested-by: Wen Jin <wen.jin@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210818175701.1611513-2-tony.luck@intel.com
2021-06-17EDAC/Intel: Do not load EDAC driver when running as a guestLuck, Tony
There's little to no point in loading an EDAC driver running in a guest: 1) The CPU model reported by CPUID may not represent actual h/w 2) The hypervisor likely does not pass in access to memory controller devices 3) Hypervisors generally do not pass corrected error details to guests Add a check in each of the Intel EDAC drivers for X86_FEATURE_HYPERVISOR and simply return -ENODEV in the init routine. Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210615174419.GA1087688@agluck-desk2.amr.corp.intel.com
2021-06-17EDAC/i10nm: Add support for high bandwidth memoryQiuxu Zhuo
A future Xeon processor will include in-package HBM (high bandwidth memory). The in-package HBM memory controller shares the same architecture with the regular DDR memory controller. Add the HBM memory controller devices for EDAC support. Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210611170123.1057025-4-tony.luck@intel.com
2021-06-17EDAC/i10nm: Add detection of memory levels for ICX/SPR serversQiuxu Zhuo
Current i10nm_edac driver is only for system configured in 1-level memory. If the system is configured in 2-level memory, the driver doesn't report the 1st level memory DIMM for the error address, even if the error occurs in the 1st level memory. Both Ice Lake servers and Sapphire Rapids servers can be configured in 2-level memory. Add detection of memory levels to i10nm_edac for the two kinds of servers so that the driver can report the 2nd level memory DIMM or the 1st level memory DIMM according to error source. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210611170123.1057025-3-tony.luck@intel.com
2020-11-19EDAC/i10nm: Add Intel Sapphire Rapids server supportQiuxu Zhuo
The Sapphire Rapids CPU model shares the same memory controller architecture with Ice Lake server. There are some configurations different from Ice Lake server as below: - The device ID for configuration agent. - The size for per channel memory-mapped I/O. - The DDR5 memory support. So add the above configurations and the Sapphire Rapids CPU model ID for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-11-19EDAC/i10nm: Use readl() to access MMIO registersQiuxu Zhuo
Instead of raw access, use readl() to access MMIO registers of memory controller to avoid possible compiler re-ordering. Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-06-15EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurationsQiuxu Zhuo
Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU stepping specific configurations to {skx,i10nm}_init(), so can delete the CPU stepping check from 10nm_init(). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com
2020-06-01Merge branches 'edac-i10nm' and 'edac-misc' into edac-updates-for-5.8Borislav Petkov
Signed-off-by: Borislav Petkov <bp@suse.de>
2020-05-19EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enableQiuxu Zhuo
The skx_edac driver wrongly uses the mtr register to retrieve two fields close_pg and bank_xor_enable. Fix it by using the correct mcmtr register to get the two fields. Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reported-by: Matthew Riley <mattdr@google.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
2020-04-27EDAC/i10nm: Update driver to support different bus number config register ↵Qiuxu Zhuo
offsets The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville servers with CPU stepping >=4, the offset for bus number configuration register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the steppings use the updated 0xd0 offset. Fix the issue by using the appropriate offset for bus number configuration register according to the CPU model number and stepping. Reported-by: Jerry Chen <jerry.t.chen@intel.com> Reported-and-tested-by: Jin Wen <wen.jin@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic
2020-04-27EDAC, {skx,i10nm}: Make some configurations CPU model specificQiuxu Zhuo
The device ID for configuration agent PCI device and the offset for bus number configuration register can be CPU model specific. So add a new structure res_config to make them configurable and pass res_config to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
2020-03-24EDAC: Convert to new X86 CPU match macrosThomas Gleixner
The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
2019-11-09EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() functionRobert Richter
The EDAC_DIMM_PTR() macro takes 3 arguments from struct mem_ctl_info. Clean up this interface to only pass the mci struct and replace this macro with a new function edac_get_dimm(). Also introduce an edac_get_dimm_by_index() function for later use. This allows it to get a DIMM pointer only by a given index. This can be useful if the DIMM's position within the layers of the memory controller or the exact size of the layers are unknown. Small style changes made for some hunks after applying the semantic patch. Semantic patch used: @@ expression mci, a, b,c; @@ -EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, a, b, c) +edac_get_dimm(mci, a, b, c) [ bp: Touchups. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-2-rrichter@marvell.com
2019-08-28x86/intel: Aggregate microserver namingPeter Zijlstra
Currently big microservers have _XEON_D while small microservers have _X, Make it uniformly: _D. for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"` do sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \ -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i} done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org
2019-06-26EDAC, skx, i10nm: Fix source ID register offsetQiuxu Zhuo
The source ID register offset for Skylake server is 0xf0, while for Icelake server is 0xf8. Pass the correct offset to get the source ID. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>