summaryrefslogtreecommitdiff
path: root/drivers/cxl/core/region.c
AgeCommit message (Collapse)Author
2025-05-23Merge branch 'for-6.16/cxl-features-ras' into cxl-for-nextDave Jiang
Add CXL RAS Features support. Features include "patrol scrub control", "error check scrub", "perform maintenance", and "memory sparing". This support connects the RAS Featurs to EDAC.
2025-05-23cxl/edac: Add CXL memory device patrol scrub control featureShiju Jose
CXL spec 3.2 section 8.2.10.9.11.1 describes the device patrol scrub control feature. The device patrol scrub proactively locates and makes corrections to errors in regular cycle. Allow specifying the number of hours within which the patrol scrub must be completed, subject to minimum and maximum limits reported by the device. Also allow disabling scrub allowing trade-off error rates against performance. Add support for patrol scrub control on CXL memory devices. Register with the EDAC device driver, which retrieves the scrub attribute descriptors from EDAC scrub and exposes the sysfs scrub control attributes to userspace. For example, scrub control for the CXL memory device "cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/. Additionally, add support for region-based CXL memory patrol scrub control. CXL memory regions may be interleaved across one or more CXL memory devices. For example, region-based scrub control for "cxl_region1" is exposed in /sys/bus/edac/devices/cxl_region1/scrubX/. [dj: A few formatting fixes from Jonathan] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250521124749.817-4-shiju.jose@huawei.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09Merge branch 'for-6.16/cxl-cleanups' into cxl-for-nextDave Jiang
In preparation for code changes related to AMD Zen5 address translation support, a number of small code refactor and cleanups are send ahead.
2025-05-09cxl/region: Add a dev_err() on missing target list entriesRobert Richter
Broken target lists are hard to discover as the driver fails at a later initialization stage. Add an error message for this. Example log messages: cxl_mem mem1: failed to find endpoint6:0000:e0:01.3 in target list of decoder1.1 cxl_port endpoint6: failed to register decoder6.0: -6 cxl_port endpoint6: probe: 0 Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-14-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Add a dev_warn() on registration failureRobert Richter
Esp. in complex system configurations with multiple endpoints and interleaving setups it is hard to detect region setup failures as its registration may silently fail. Add messages to show registration failures. Example log message: cxl region5: region sort successful cxl region5: mem0:endpoint5 decoder5.0 add: mem0:decoder5.0 @ 0 next: none nr_eps: 1 nr_targets: 1 cxl_port endpoint5: decoder5.0: range: 0x22350000000-0x2634fffffff iw: 1 ig: 256 cxl region5: pci0000:e0:port1 decoder1.2 add: mem0:decoder5.0 @ 0 next: mem0 nr_eps: 1 nr_targets: 1 cxl region5: pci0000:e0:port1 iw: 1 ig: 256 cxl region5: pci0000:e0:port1: decoder1.2 expected 0000:e0:01.2 at 0 cxl endpoint5: failed to attach decoder5.0 to region5: -6 cxl_port endpoint5: probe: 0 Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-13-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Add function to find a port's switch decoder by rangeRobert Richter
Factor out code to find the switch decoder of a port for a specific address range. Reuse the code to search a root decoder, create the function cxl_port_find_switch_decoder() and rework match_root_decoder_by_range() to be usable for switch decoders too. Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-12-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Factor out code to find a root decoder's regionRobert Richter
In function cxl_add_to_region() there is code to determine a root decoder's region. Factor that code out. This is in preparation to further rework and simplify function cxl_add_to_region(). The reference count must be decremented after using the region. cxl_find_region_by_range() is paired with the put_cxl_region cleanup helper that can be used for this. [dj: Fixed up "obj __free(...) = NULL" pattern] Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-11-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Factor out code to find the root decoderRobert Richter
In function cxl_add_to_region() there is code to determine the root decoder associated to an endpoint decoder. Factor out that code for later reuse. This has the benefit of reducing cxl_add_to_region()'s function complexity. The reference count must be decremented after using the root decoder. cxl_find_root_decoder() is paired with the put_cxl_root_decoder cleanup helper that can be used for this. [dj: Fixed up "obj __free(...) = NULL" pattern] Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-10-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Move find_cxl_root() to cxl_add_to_region()Robert Richter
When adding an endpoint to a region, the root port is determined first. Move this directly into cxl_add_to_region(). This is in preparation of the initialization of endpoints that iterates the port hierarchy from the endpoint up to the root port. As a side-effect the root argument is removed from the argument lists of cxl_add_to_region() and related functions. Now, the endpoint is the only parameter to add a region. This simplifies the function interface. Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-8-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Avoid duplicate call of cxl_port_pick_region_decoder()Robert Richter
Function cxl_port_pick_region_decoder() is called twice, in alloc_region_ref() and cxl_rr_alloc_decoder(). Both functions are subsequently called from cxl_port_attach_region(). Make the decoder a function argument to both which avoids a duplicate call of cxl_port_pick_region_decoder(). Now, cxl_rr_alloc_decoder() no longer allocates the decoder. Instead, the previously picked decoder is assigned to the region reference. Hence, rename the function to cxl_rr_assign_decoder(). Moving the call out of alloc_region_ref() also moves it out of the xa_for_each() loop in there. Now, cxld is determined no longer only for each auto-generated region, but now once for all regions regardless of auto-generated or not. This is fine as the cxld argument is needed for all regions in cxl_rr_assign_decoder() and an error would be returned otherwise anyway. So it is better to determine the decoder in front of all this and fail early if missing instead of running through all that code with multiple calls of cxl_port_pick_region_decoder(). Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-7-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl/region: Rename function to cxl_port_pick_region_decoder()Robert Richter
Current function cxl_region_find_decoder() is used to find a port's decoder during region setup. In the region creation path the function is an allocator to find a free port. In the region assembly path, it is recalling the decoder that platform firmware picked for validation purposes. Rename function to cxl_port_pick_region_decoder() that better describes its use and update the function's description. The result of cxl_port_pick_region_decoder() is recorded in a 'struct cxl_region_ref' in @port for later recall when other endpoints might also be targets of the picked decoder. Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-6-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl: Introduce parent_port_of() helperRobert Richter
Often a parent port must be determined. Introduce the parent_port_of() helper function to avoid open coding of determination of a parent port. Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-5-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-09cxl: Remove else after returnRobert Richter
Remove unnecessary 'else' after return. Improves readability of code. It is easier to place comments. Check and fix all occurrences under drivers/cxl/. Signed-off-by: Robert Richter <rrichter@amd.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> Tested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250509150700.2817697-2-rrichter@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-04-28cxl: core/region - ignore interleave granularity when ways=1Gregory Price
When validating decoder IW/IG when setting up regions, the granularity is irrelevant when iw=1 - all accesses will always route to the only target anyway - so all ig values are "correct". Loosen the requirement that `ig = (parent_iw * parent_ig)` when iw=1. On some Zen5 platforms, the platform BIOS specifies a 256-byte interleave granularity window for host bridges when there is only one target downstream. This leads to Linux rejecting the configuration of a region with a x2 root with two x1 hostbridges. Decoder Programming: root - iw:2 ig:256 hb1 - iw:1 ig:256 (Linux expects 512) hb2 - iw:1 ig:256 (Linux expects 512) ep1 - iw:2 ig:256 ep2 - iw:2 ig:256 This change allows all decoders downstream of a passthrough decoder to also be configured as passthrough (iw:1 ig:X), but still disallows downstream decoders from applying subsequent interleaves. e.g. in the above example if there was another decoder south of hb1 attempting to interleave 2 endpoints - Linux would enforce hb1.ig=512 because the southern decoder would have iw:2 and require ig=pig*piw. [DJ: Fixed up against 6.15-rc1] Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250402232552.999634-1-gourry@gourry.net Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-03-20cxl/region: Fix the first aliased address miscalculationLi Ming
In extended linear cache(ELC) case, cxl_port_get_spa_cache_alias() helps to get the aliased address of a SPA, it considers the first address in CXL memory range is "region start + region cache size + 1", but it should be "region start + region cache size". So if a SPA is equal to "region start + region cache size", its aliased address should be "SPA - region cache size". Signed-off-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/20250317070124.815028-1-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-03-14cxl/region: Quiet some dev_warn()s in extended linear cache setupAlison Schofield
Extended Linear Cache (ELC) setup code emits a dev_warn(), "Extended linear cache calculation failed." for issues found while setting up the ELC. For platforms without CONFIG_ACPI_HMAT, every auto region setup will emit the warning because the default !ACPI_HMAT return value is EOPNOTSUPP. Suppress it by skipping the warn for EOPNOTSUPP. Change the EOPNOTSUPP in the actual ELC failure path to ENXIO. Remove the check and enusing dev_warn() when region resource size is NULL. The endpoint decoders hpa_range used to create the resource is checked in init_hdm_decoder(), so it cannot be NULL here. For good measure, add the rc value to the dev_warn(). It will either be the -ENOENT returned by HMAT if the mem target is not found, or the -ENXIO from the region driver calculation. Reviewed-by: Li Ming <ming.li@zohomail.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250306213700.2606304-1-alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-03-14cxl: Fix warning from emitting resource_size_t as long long int on 32bit systemsDave Jiang
Reported by kernel test bot from an ARM build: drivers/cxl/core/region.c:3263:26: warning: format '%lld' expects argument of type 'long long int', but argument 3 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=] On a 32bit system, resource_size_t is defined as 32bit long vs on a 64bit system it is defined as 64bit long. Use %pa format to deal with resource_size_t. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503010252.mIDhZ5kY-lkp@intel.com/ Fixes: 0ec9849b6333 ("acpi/hmat / cxl: Add extended linear cache support for CXL") Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250228204739.3849309-1-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-03-14Merge branch 'for-6.15/extended-linear-cache' into cxl-for-next2Dave Jiang
Add support for Extended Linear Cache for CXL. Add enumeration support of the cache. Add MCE notification of the aliased memory address.
2025-03-14Merge branch 'for-6.15/guard_cleanups' into cxl-for-next2Dave Jiang
A series of CXL refactoring using scope based resource management to remove goto patterns on the cleanup paths.
2025-03-14cxl/region: Drop goto pattern of construct_region()Li Ming
Some operations need to be protected by the cxl_region_rwsem in construct_region(). Currently, construct_region() uses down_write() and up_write() for the cxl_region_rwsem locking, so there is a goto pattern after down_write() invoked to release cxl_region_rwsem. construct region() can be optimized to remove the goto pattern. The changes are creating a new function called __construct_region() which will include all checking and operations protected by the cxl_region_rwsem, and using guard(rwsem_write) to replace down_write() and up_write() in __construct_region(). Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Li Ming <ming.li@zohomail.com> Link: https://patch.msgid.link/20250221013205.126419-1-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-03-14cxl/region: Drop goto pattern in cxl_dax_region_alloc()Li Ming
In cxl_dax_region_alloc(), there is a goto pattern to release the rwsem cxl_region_rwsem when the function returns, the down_read() and up_read can be replaced by a guard(rwsem_read) then the goto pattern can be removed. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Li Ming <ming.li@zohomail.com> Link: https://patch.msgid.link/20250221012453.126366-7-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-03-14cxl/core: Use guard() to replace open-coded down_read/write()Li Ming
Some down/up_read() and down/up_write() cases can be replaced by a guard() simply to drop explicit unlock invoked. It helps to align coding style with current CXL subsystem's. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Li Ming <ming.li@zohomail.com> Link: https://patch.msgid.link/20250221012453.126366-2-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-02-26cxl: Add mce notifier to emit aliased address for extended linear cacheDave Jiang
Below is a setup with extended linear cache configuration with an example layout of memory region shown below presented as a single memory region consists of 256G memory where there's 128G of DRAM and 128G of CXL memory. The kernel sees a region of total 256G of system memory. 128G DRAM 128G CXL memory |-----------------------------------|-------------------------------------| Data resides in either DRAM or far memory (FM) with no replication. Hot data is swapped into DRAM by the hardware behind the scenes. When error is detected in one location, it is possible that error also resides in the aliased location. Therefore when a memory location that is flagged by MCE is part of the special region, the aliased memory location needs to be offlined as well. Add an mce notify callback to identify if the MCE address location is part of an extended linear cache region and handle accordingly. Added symbol export to set_mce_nospec() in x86 code in order to call set_mce_nospec() from the CXL MCE notify callback. Link: https://lore.kernel.org/linux-cxl/668333b17e4b2_5639294fd@dwillia2-xfh.jf.intel.com.notmuch/ Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250226162224.3633792-5-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-02-26cxl: Add extended linear cache address alias emission for cxl eventsDave Jiang
Add the aliased address of extended linear cache when emitting event trace for poison, DRAM and general media of CXL events. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250226162224.3633792-4-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-02-26acpi/hmat / cxl: Add extended linear cache support for CXLDave Jiang
The current cxl region size only indicates the size of the CXL memory region without accounting for the extended linear cache size. Retrieve the cache size from HMAT and append that to the cxl region size for the cxl region range that matches the SRAT range that has extended linear cache enabled. The SRAT defines the whole memory range that includes the extended linear cache and the CXL memory region. The new HMAT ECN/ECR to the Memory Side Cache Information Structure defines the size of the extended linear cache size and matches to the SRAT Memory Affinity Structure by the memory proxmity domain. Add a helper to match the cxl range to the SRAT memory range in order to retrieve the cache size. There are several places that checks the cxl region range against the decoder range. Use new helper to check between the two ranges and address the new cache size. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250226162224.3633792-3-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-02-04cxl: Kill enum cxl_decoder_modeDan Williams
Now that the operational mode of DPA capacity (ram vs pmem... etc) is tracked in the partition, and no code paths have dependencies on the mode implying the partition index, the ambiguous 'enum cxl_decoder_mode' can be cleaned up, specifically this ambiguity on whether the operation mode implied anything about the partition order. Endpoint decoders simply reference their assigned partition where the operational mode can be retrieved as partition mode. With this in place PMEM can now be partition0 which happens today when the RAM capacity size is zero. Dynamic RAM can appear above PMEM when DCD arrives, etc. Code sequences that hard coded the "PMEM after RAM" assumption can now just iterate partitions and consult the partition mode after the fact. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Alejandro Lucero <alucerop@amd.com> Link: https://patch.msgid.link/173864306972.668823.3327008645125276726.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-02-04cxl: Introduce to_{ram,pmem}_{res,perf}() helpersDan Williams
In preparation for consolidating all DPA partition information into an array of DPA metadata, introduce helpers that hide the layout of the current data. I.e. make the eventual replacement of ->ram_res, ->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a no-op for code paths that consume that information, and reduce the noise of follow-on patches. The end goal is to consolidate all DPA information in 'struct cxl_dev_state', but for now the helpers just make it appear that all DPA metadata is relative to @cxlds. As the conversion to generic partition metadata walking is completed, these helpers will naturally be eliminated, or reduced in scope. Cc: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Tested-by: Alejandro Lucero <alucerop@amd.com> Link: https://patch.msgid.link/173864305238.668823.16553986866633608541.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-02-04cxl: Remove the CXL_DECODER_MIXED mistakeDan Williams
CXL_DECODER_MIXED is a safety mechanism introduced for the case where platform firmware has programmed an endpoint decoder that straddles a DPA partition boundary. While the kernel is careful to only allocate DPA capacity within a single partition there is no guarantee that platform firmware, or anything that touched the device before the current kernel, gets that right. However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED designation because of the way it tracks partition boundaries. A request_resource() that spans ->ram_res and ->pmem_res fails with the following signature: __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation CXL_DECODER_MIXED is dead defensive programming after the driver has already given up on the device. It has never offered any protection in practice, just delete it. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Tested-by: Alejandro Lucero <alucerop@amd.com> Link: https://patch.msgid.link/173864304660.668823.17000888505587850279.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-01-13Merge 6.13-rc7 into driver-core-nextGreg Kroah-Hartman
We need the debugfs / driver-core fixes in here as well for testing and to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-10driver core: Correct API device_for_each_child_reverse_from() prototypeZijun Hu
For API device_for_each_child_reverse_from(..., const void *data, int (*fn)(struct device *dev, const void *data)) - Type of @data is const pointer, and means caller's data @*data is not allowed to be modified, but that usually is not proper for such non finding device iterating API. - Types for both @data and @fn are not consistent with all other for_each device iterating APIs device_for_each_child(_reverse)(), bus_for_each_dev() and (driver|class)_for_each_device(). Correct its prototype by removing const from parameter types, then adapt for various existing usages. An dedicated typedef device_iter_t will be introduced as @fn() type for various for_each device interating APIs later. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250105-class_fix-v6-6-3a2f1768d4d4@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-03driver core: Constify API device_find_child() and adapt for various usagesZijun Hu
Constify the following API: struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); To : struct device *device_find_child(struct device *dev, const void *data, device_match_t match); typedef int (*device_match_t)(struct device *dev, const void *data); with the following reasons: - Protect caller's match data @*data which is for comparison and lookup and the API does not actually need to modify @*data. - Make the API's parameters (@match)() and @data have the same type as all of other device finding APIs (bus|class|driver)_find_device(). - All kinds of existing device match functions can be directly taken as the API's argument, they were exported by driver core. Constify the API and adapt for various existing usages. BTW, various subsystem changes are squashed into this commit to meet 'git bisect' requirement, and this commit has the minimal and simplest changes to complement squashing shortcoming, and that may bring extra code improvement. Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Takashi Sakamoto <o-takashi@sakamocchi.jp> Acked-by: Uwe Kleine-König <ukleinek@kernel.org> # for drivers/pwm Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20241224-const_dfc_done-v5-4-6623037414d4@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-10cxl/region: Fix region creation for greater than x2 switchesHuaisheng Ye
The cxl_port_setup_targets() algorithm fails to identify valid target list ordering in the presence of 4-way and above switches resulting in 'cxl create-region' failures of the form: $ cxl create-region -d decoder0.0 -g 1024 -s 2G -t ram -w 8 -m mem4 mem1 mem6 mem3 mem2 mem5 mem7 mem0 cxl region: create_region: region0: failed to set target7 to mem0 cxl region: cmd_create_region: created 0 regions [kernel debug message] check_last_peer:1213: cxl region0: pci0000:0c:port1: cannot host mem6:decoder7.0 at 2 bus_remove_device:574: bus: 'cxl': remove device region0 QEMU can create this failing topology: ACPI0017:00 [root0] | HB_0 [port1] / \ RP_0 RP_1 | | USP [port2] USP [port3] / / \ \ / / \ \ DSP DSP DSP DSP DSP DSP DSP DSP | | | | | | | | mem4 mem6 mem2 mem7 mem1 mem3 mem5 mem0 Pos: 0 2 4 6 1 3 5 7 HB: Host Bridge RP: Root Port USP: Upstream Port DSP: Downstream Port ...with the following command steps: $ qemu-system-x86_64 -machine q35,cxl=on,accel=tcg \ -smp cpus=8 \ -m 8G \ -hda /home/work/vm-images/centos-stream8-02.qcow2 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -object memory-backend-ram,size=2G,id=cxl-mem0 \ -object memory-backend-ram,size=2G,id=cxl-mem1 \ -object memory-backend-ram,size=2G,id=cxl-mem2 \ -object memory-backend-ram,size=2G,id=cxl-mem3 \ -object memory-backend-ram,size=2G,id=cxl-mem4 \ -object memory-backend-ram,size=2G,id=cxl-mem5 \ -object memory-backend-ram,size=2G,id=cxl-mem6 \ -object memory-backend-ram,size=2G,id=cxl-mem7 \ -numa node,memdev=m0,cpus=0-3,nodeid=0 \ -numa node,memdev=m1,cpus=4-7,nodeid=1 \ -netdev user,id=net0,hostfwd=tcp::2222-:22 \ -device virtio-net-pci,netdev=net0 \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \ -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \ -device cxl-upstream,bus=root_port0,id=us0 \ -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \ -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \ -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \ -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \ -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \ -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \ -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \ -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \ -device cxl-upstream,bus=root_port1,id=us1 \ -device cxl-downstream,port=4,bus=us1,id=swport4,chassis=0,slot=8 \ -device cxl-type3,bus=swport4,volatile-memdev=cxl-mem4,id=cxl-vmem4 \ -device cxl-downstream,port=5,bus=us1,id=swport5,chassis=0,slot=9 \ -device cxl-type3,bus=swport5,volatile-memdev=cxl-mem5,id=cxl-vmem5 \ -device cxl-downstream,port=6,bus=us1,id=swport6,chassis=0,slot=10 \ -device cxl-type3,bus=swport6,volatile-memdev=cxl-mem6,id=cxl-vmem6 \ -device cxl-downstream,port=7,bus=us1,id=swport7,chassis=0,slot=11 \ -device cxl-type3,bus=swport7,volatile-memdev=cxl-mem7,id=cxl-vmem7 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=32G & In Guest OS: $ cxl create-region -d decoder0.0 -g 1024 -s 2G -t ram -w 8 -m mem4 mem1 mem6 mem3 mem2 mem5 mem7 mem0 Fix the method to calculate @distance by iterativeley multiplying the number of targets per switch port. This also follows the algorithm recommended here [1]. Fixes: 27b3f8d13830 ("cxl/region: Program target lists") Link: http://lore.kernel.org/6538824b52349_7258329466@dwillia2-xfh.jf.intel.com.notmuch [1] Signed-off-by: Huaisheng Ye <huaisheng.ye@intel.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> [djbw: add a comment explaining 'distance'] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/173378716722.1270362.9546805175813426729.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-22Merge tag 'cxl-for-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl updates from Dave Jiang: - Constify range_contains() input parameters to prevent changes - Add support for displaying RCD capabilities in sysfs to support lspci for CXL device - Downgrade warning message to debug in cxl_probe_component_regs() - Add support for adding a printf specifier '%pra' to emit 'struct range' content: - Add sanity tests for 'struct resource' - Add documentation for special case - Add %pra for 'struct range' - Add %pra usage in CXL code - Add preparation code for DCD support: - Add range_overlaps() - Add CDAT DSMAS table shared and read only flag in ACPICA - Add documentation to 'struct dev_dax_range' - Delay event buffer allocation in CXL PCI code until needed - Use guard() in cxl_dpa_set_mode() - Refactor create region code to consolidate common code * tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/region: Refactor common create region code cxl/hdm: Use guard() in cxl_dpa_set_mode() cxl/pci: Delay event buffer allocation dax: Document struct dev_dax_range ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values range: Add range_overlaps() cxl/cdat: Use %pra for dpa range outputs printf: Add print format (%pra) for struct range Documentation/printf: struct resource add start == end special case test printf: Add very basic struct resource tests cxl: downgrade a warning message to debug level in cxl_probe_component_regs() cxl/pci: Add sysfs attribute for CXL 1.1 device link status cxl/core/regs: Add rcd_pcie_cap initialization kernel/range: Const-ify range_contains parameters
2024-11-08cxl/region: Refactor common create region codeIra Weiny
create_pmem_region_store() and create_ram_region_store() are identical with the exception of the region mode. With the addition of DC region mode this would end up being 3 copies of the same code. Refactor create_pmem_region_store() and create_ram_region_store() to use a single common function to be used in subsequent DC code. Suggested-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Li Ming <ming4.li@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/20241107-dcd-type2-upstream-v7-6-56a84e66bc36@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-10-25cxl/port: Prevent out-of-order decoder allocationDan Williams
With the recent change to allow out-of-order decoder de-commit it highlights a need to strengthen the in-order decoder commit guarantees. As it stands match_free_decoder() ensures that if 2 regions are racing decoder allocations the one that wins the race will get the lower id decoder, but that still leaves the race to *commit* the decoder. Rather than have this complicated case of "reserved in-order, but may still commit out-of-order", just arrange for the reservation order to match the commit-order. In other words, prevent subsequent allocations until the last reservation is committed. This precludes overlapping region creation events and requires the previous regionN to either move forward to the decoder commit stage or drop its reservation before regionN+1 can move forward. That is, provided that regionN and regionN+1 decode through the same switch port. As a side effect this allows match_free_decoder() to drop its dependency on needing write access to the device_find_child() @data parameter [1]. Reported-by: Zijun Hu <quic_zijuhu@quicinc.com> Closes: http://lore.kernel.org/20240905-const_dfc_prepare-v4-0-4180e1d5a244@quicinc.com Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/172964783668.81806.14962699553881333486.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com>
2024-10-25cxl/port: Fix use-after-free, permit out-of-order decoder shutdownDan Williams
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature: cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace: <TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core] At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted. The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them. In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction. A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously. Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/172964782781.81806.17902885593105284330.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com>
2024-09-22cxl: Calculate region bandwidth of targets with shared upstream linkDave Jiang
The current bandwidth calculation aggregates all the targets. This simple method does not take into account where multiple targets sharing under a switch or a root port where the aggregated bandwidth can be greater than the upstream link of the switch. To accurately account for the shared upstream uplink cases, a new update function is introduced by walking from the leaves to the root of the hierarchy and clamp the bandwidth in the process as needed. This process is done when all the targets for a region are present but before the final values are send to the HMAT handling code cached access_coordinate targets. The original perf calculation path was kept to calculate the latency performance data that does not require the shared link consideration. The shared upstream link calculation is done as a second pass when all the endpoints have arrived. Testing is done via qemu with CXL hierarchy. run_qemu[1] is modified to support several CXL hierarchy layouts. The following layouts are tested: HB: Host Bridge RP: Root Port SW: Switch EP: End Point 2 HB 2 RP 2 EP: resulting bandwidth: 624 1 HB 2 RP 2 EP: resulting bandwidth: 624 2 HB 2 RP 2 SW 4 EP: resulting bandwidth: 624 Current testing, perf number from SRAT/HMAT is hacked into the kernel code. However with new QEMU support of Generic Target Port that's incoming, the perf data injection is no longer needed. [1]: https://github.com/pmem/run_qemu Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://lore.kernel.org/linux-cxl/20240501152503.00002e60@Huawei.com/ Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20240904001316.1688225-3-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-09-09cxl/region: Remove lock from memory notifier callbackIra Weiny
In testing Dynamic Capacity Device (DCD) support, a lockdep splat revealed an ABBA issue between the memory notifiers and the DCD extent processing code.[0] Changing the lock ordering within DCD proved difficult because regions must be stable while searching for the proper region and then the device lock must be held to properly notify the DAX region driver of memory changes. Dan points out in the thread that notifiers should be able to trust that it is safe to access static data. Region data is static once the device is realized and until it's destruction. Thus it is better to manage the notifiers within the region driver. Remove the need for a lock by ensuring the notifiers are active only during the region's lifetime. Furthermore, remove cxl_region_nid() because resource can't be NULL while the region is stable. Link: https://lore.kernel.org/all/66b4cf539a79b_a36e829416@iweiny-mobl.notmuch/ [0] Cc: Ying Huang <ying.huang@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ying Huang <ying.huang@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/20240904-fix-notifiers-v3-1-576b4e950266@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-09-03cxl/port: Use scoped_guard()/guard() to drop device_lock() for cxl_portLi Ming
A device_lock() and device_unlock() pair can be replaced by a cleanup helper scoped_guard() or guard(), that can enhance code readability. In CXL subsystem, still use device_lock() and device_unlock() pairs for cxl port resource protection, most of them can be replaced by a scoped_guard() or a guard() simply. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Li Ming <ming4.li@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240830013138.2256244-2-ming4.li@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-07-28Merge tag 'cxl-for-6.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL updates from Dave Jiang: "Core: - A CXL maturity map has been added to the documentation to detail the current state of CXL enabling. It provides the status of the current state of various CXL features to inform current and future contributors of where things are and which areas need contribution. - A notifier handler has been added in order for a newly created CXL memory region to trigger the abstract distance metrics calculation. This should bring parity for CXL memory to the same level vs hotplugged DRAM for NUMA abstract distance calculation. The abstract distance reflects relative performance used for memory tiering handling. - An addition for XOR math has been added to address the CXL DPA to SPA translation. CXL address translation did not support address interleave math with XOR prior to this change. Fixes: - Fix to address race condition in the CXL memory hotplug notifier - Add missing MODULE_DESCRIPTION() for CXL modules - Fix incorrect vendor debug UUID define Misc: - A warning has been added to inform users of an unsupported configuration when mixing CXL VH and RCH/RCD hierarchies - The ENXIO error code has been replaced with EBUSY for inject poison limit reached via debugfs and cxl-test support - Moving the PCI config read in cxl_dvsec_rr_decode() to avoid unnecessary PCI config reads - A refactor to a common struct for DRAM and general media CXL events" * tag 'cxl-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/core/pci: Move reading of control register to immediately before usage cxl: Remove defunct code calculating host bridge target positions cxl/region: Verify target positions using the ordered target list cxl: Restore XOR'd position bits during address translation cxl/core: Fold cxl_trace_hpa() into cxl_dpa_to_hpa() cxl/test: Replace ENXIO with EBUSY for inject poison limit reached cxl/memdev: Replace ENXIO with EBUSY for inject poison limit reached cxl/acpi: Warn on mixed CXL VH and RCH/RCD Hierarchy cxl/core: Fix incorrect vendor debug UUID define Documentation: CXL Maturity Map cxl/region: Simplify cxl_region_nid() cxl/region: Support to calculate memory tier abstract distance cxl/region: Fix a race condition in memory hotplug notifier cxl: add missing MODULE_DESCRIPTION() macros cxl/events: Use a common struct for DRAM and General Media events
2024-07-11Merge branch 'for-6.11/xor_fixes' into cxl-for-nextDave Jiang
Series to fix XOR math for DPA to SPA translation - Refactor and fold cxl_trace_hpa() into cxl_dpa_to_hpa() - Complete DPA->HPA->SPA translation and correct XOR translation issue - Add new method to verify a CXL target position - Remove old method of CXL target position verifiation
2024-07-11cxl/region: Verify target positions using the ordered target listAlison Schofield
When a root decoder is configured the interleave target list is read from the BIOS populated CFMWS structure. Per the CXL spec 3.1 Table 9-22 the target list is in interleave order. The CXL driver populates its decoder target list in the same order and stores it in 'struct cxl_switch_decoder' field "@target: active ordered target list in current decoder configuration" Given the promise of an ordered list, the driver can stop duplicating the work of BIOS and simply check target positions against the ordered list during region configuration. The simplified check against the ordered list is presented here. A follow-on patch will remove the unused code. For Modulo arithmetic this is not a fix, only a simplification. For XOR arithmetic this is a fix for HB IW of 3,6,12. Fixes: f9db85bfec0d ("cxl/acpi: Support CXL XOR Interleave Math (CXIMS)") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/35d08d3aba08fee0f9b86ab1cef0c25116ca8a55.1719980933.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-07-11cxl: Restore XOR'd position bits during address translationAlison Schofield
When a device reports a DPA in events like poison, general_media, and dram, the driver translates that DPA back to an HPA. Presently, the CXL driver translation only considers the Modulo position and will report the wrong HPA for XOR configured root decoders. Add a helper function that restores the XOR'd bits during DPA->HPA address translation. Plumb a root decoder callback to the new helper when XOR interleave arithmetic is in use. For Modulo arithmetic, just let the callback be NULL - as in no extra work required. Upon completion of a DPA->HPA translation a couple of checks are performed on the result. One simply confirms that the calculated HPA is within the address range of the region. That test is useful for both Modulo and XOR interleave arithmetic decodes. A second check confirms that the HPA is within an expected chunk based on the endpoints position in the region and the region granularity. An XOR decode disrupts the Modulo pattern making the chunk check useless. To align the checks with the proper decode, pull the region range check inline and use the helper to do the chunk check for Modulo decodes only. A cxl-test unit test is posted for upstream review here: https://lore.kernel.org/20240624210644.495563-1-alison.schofield@intel.com/ Fixes: 28a3ae4ff66c ("cxl/trace: Add an HPA to cxl_poison trace events") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/1a1ac880d9f889bd6384e657e810431b9a0a72e5.1719980933.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-07-11cxl/core: Fold cxl_trace_hpa() into cxl_dpa_to_hpa()Alison Schofield
Although cxl_trace_hpa() is used to populate TRACE EVENTs with HPA addresses the work it performs is a DPA to HPA translation not a trace. Tidy up this naming by moving the minimal work done in cxl_trace_hpa() into cxl_dpa_to_hpa() and use cxl_dpa_to_hpa() for trace event callbacks. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Robert Richter <rrichter@amd.com> Link: https://patch.msgid.link/452a9b0c525b774c72d9d5851515ffa928750132.1719980933.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-07-02cxl/region: Simplify cxl_region_nid()Huang Ying
The node ID of the region can be gotten via resource start address directly. This simplifies the implementation of cxl_region_nid(). Signed-off-by: Huang Ying <ying.huang@intel.com> Suggested-by: Alison Schofield <alison.schofield@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240618084639.1419629-4-ying.huang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-07-02cxl/region: Support to calculate memory tier abstract distanceHuang Ying
An abstract distance value must be assigned by the driver that makes the memory available to the system. It reflects relative performance and is used to place memory nodes backed by CXL regions in the appropriate memory tiers allowing promotion/demotion within the existing memory tiering mechanism. The abstract distance is calculated based on the memory access latency and bandwidth of CXL regions. Signed-off-by: Huang, Ying <ying.huang@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240618084639.1419629-3-ying.huang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-07-02cxl/region: Fix a race condition in memory hotplug notifierHuang Ying
In the memory hotplug notifier function of the CXL region, cxl_region_perf_attrs_callback(), the node ID is obtained by checking the host address range of the region. However, the address range information is not available when the region is registered in devm_cxl_add_region(). Additionally, this information may be removed or added under the protection of cxl_region_rwsem during runtime. If the memory notifier is called for nodes other than that backed by the region, a race condition may occur, potentially leading to a NULL dereference or an invalid address range. The race condition is addressed by checking the availability of the address range information under the protection of cxl_region_rwsem. To enhance code readability and use guard(), the relevant code has been moved into a newly added function: cxl_region_nid(). Fixes: 067353a46d8c ("cxl/region: Add memory hotplug notifier for cxl region") Signed-off-by: Huang, Ying <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240618084639.1419629-2-ying.huang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-06-25cxl/region: check interleave capabilityYao Xingtao
Since interleave capability is not verified, if the interleave capability of a target does not match the region need, committing decoder should have failed at the device end. In order to checkout this error as quickly as possible, driver needs to check the interleave capability of target during attaching it to region. Per CXL specification r3.1(8.2.4.20.1 CXL HDM Decoder Capability Register), bits 11 and 12 indicate the capability to establish interleaving in 3, 6, 12 and 16 ways. If these bits are not set, the target cannot be attached to a region utilizing such interleave ways. Additionally, bits 8 and 9 represent the capability of the bits used for interleaving in the address, Linux tracks this in the cxl_port interleave_mask. Per CXL specification r3.1(8.2.4.20.13 Decoder Protection): eIW means encoded Interleave Ways. eIG means encoded Interleave Granularity. in HPA: if eIW is 0 or 8 (interleave ways: 1, 3), all the bits of HPA are used, the interleave bits are none, the following check is ignored. if eIW is less than 8 (interleave ways: 2, 4, 8, 16), the interleave bits start at bit position eIG + 8 and end at eIG + eIW + 8 - 1. if eIW is greater than 8 (interleave ways: 6, 12), the interleave bits start at bit position eIG + 8 and end at eIG + eIW - 1. if the interleave mask is insufficient to cover the required interleave bits, the target cannot be attached to the region. Fixes: 384e624bb211 ("cxl/region: Attach endpoint decoders") Signed-off-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240614084755.59503-2-yaoxt.fnst@fujitsu.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-06-25cxl/region: Avoid null pointer dereference in region lookupAlison Schofield
cxl_dpa_to_region() looks up a region based on a memdev and DPA. It wrongly assumes an endpoint found mapping the DPA is also of a fully assembled region. When not true it leads to a null pointer dereference looking up the region name. This appears during testing of region lookup after a failure to assemble a BIOS defined region or if the lookup raced with the assembly of the BIOS defined region. Failure to clean up BIOS defined regions that fail assembly is an issue in itself and a fix to that problem will alleviate some of the impact. It will not alleviate the race condition so let's harden this path. The behavior change is that the kernel oops due to a null pointer dereference is replaced with a dev_dbg() message noting that an endpoint was mapped. Additional comments are added so that future users of this function can more clearly understand what it provides. Fixes: 0a105ab28a4d ("cxl/memdev: Warn of poison inject or clear to a mapped region") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240604003609.202682-1-alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>