Age | Commit message (Collapse) | Author |
|
Add CXL RAS Features support. Features include "patrol scrub control",
"error check scrub", "perform maintenance", and "memory sparing". This
support connects the RAS Featurs to EDAC.
|
|
CXL spec 3.2 section 8.2.10.9.11.1 describes the device patrol scrub
control feature. The device patrol scrub proactively locates and makes
corrections to errors in regular cycle.
Allow specifying the number of hours within which the patrol scrub must be
completed, subject to minimum and maximum limits reported by the device.
Also allow disabling scrub allowing trade-off error rates against
performance.
Add support for patrol scrub control on CXL memory devices.
Register with the EDAC device driver, which retrieves the scrub attribute
descriptors from EDAC scrub and exposes the sysfs scrub control attributes
to userspace. For example, scrub control for the CXL memory device
"cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/.
Additionally, add support for region-based CXL memory patrol scrub control.
CXL memory regions may be interleaved across one or more CXL memory
devices. For example, region-based scrub control for "cxl_region1" is
exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
[dj: A few formatting fixes from Jonathan]
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250521124749.817-4-shiju.jose@huawei.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In preparation for code changes related to AMD Zen5 address translation
support, a number of small code refactor and cleanups are send ahead.
|
|
Broken target lists are hard to discover as the driver fails at a
later initialization stage. Add an error message for this.
Example log messages:
cxl_mem mem1: failed to find endpoint6:0000:e0:01.3 in target list of decoder1.1
cxl_port endpoint6: failed to register decoder6.0: -6
cxl_port endpoint6: probe: 0
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-14-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Esp. in complex system configurations with multiple endpoints and
interleaving setups it is hard to detect region setup failures as its
registration may silently fail. Add messages to show registration
failures.
Example log message:
cxl region5: region sort successful
cxl region5: mem0:endpoint5 decoder5.0 add: mem0:decoder5.0 @ 0 next: none nr_eps: 1 nr_targets: 1
cxl_port endpoint5: decoder5.0: range: 0x22350000000-0x2634fffffff iw: 1 ig: 256
cxl region5: pci0000:e0:port1 decoder1.2 add: mem0:decoder5.0 @ 0 next: mem0 nr_eps: 1 nr_targets: 1
cxl region5: pci0000:e0:port1 iw: 1 ig: 256
cxl region5: pci0000:e0:port1: decoder1.2 expected 0000:e0:01.2 at 0
cxl endpoint5: failed to attach decoder5.0 to region5: -6
cxl_port endpoint5: probe: 0
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-13-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Factor out code to find the switch decoder of a port for a specific
address range. Reuse the code to search a root decoder, create the
function cxl_port_find_switch_decoder() and rework
match_root_decoder_by_range() to be usable for switch decoders too.
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-12-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In function cxl_add_to_region() there is code to determine a root
decoder's region. Factor that code out. This is in preparation to
further rework and simplify function cxl_add_to_region().
The reference count must be decremented after using the region.
cxl_find_region_by_range() is paired with the put_cxl_region cleanup
helper that can be used for this.
[dj: Fixed up "obj __free(...) = NULL" pattern]
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-11-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In function cxl_add_to_region() there is code to determine the root
decoder associated to an endpoint decoder. Factor out that code for
later reuse. This has the benefit of reducing cxl_add_to_region()'s
function complexity.
The reference count must be decremented after using the root decoder.
cxl_find_root_decoder() is paired with the put_cxl_root_decoder
cleanup helper that can be used for this.
[dj: Fixed up "obj __free(...) = NULL" pattern]
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-10-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
When adding an endpoint to a region, the root port is determined
first. Move this directly into cxl_add_to_region(). This is in
preparation of the initialization of endpoints that iterates the port
hierarchy from the endpoint up to the root port.
As a side-effect the root argument is removed from the argument lists
of cxl_add_to_region() and related functions. Now, the endpoint is the
only parameter to add a region. This simplifies the function
interface.
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-8-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Function cxl_port_pick_region_decoder() is called twice, in
alloc_region_ref() and cxl_rr_alloc_decoder(). Both functions are
subsequently called from cxl_port_attach_region(). Make the decoder a
function argument to both which avoids a duplicate call of
cxl_port_pick_region_decoder().
Now, cxl_rr_alloc_decoder() no longer allocates the decoder. Instead,
the previously picked decoder is assigned to the region reference.
Hence, rename the function to cxl_rr_assign_decoder().
Moving the call out of alloc_region_ref() also moves it out of the
xa_for_each() loop in there. Now, cxld is determined no longer only
for each auto-generated region, but now once for all regions
regardless of auto-generated or not. This is fine as the cxld argument
is needed for all regions in cxl_rr_assign_decoder() and an error would
be returned otherwise anyway. So it is better to determine the decoder
in front of all this and fail early if missing instead of running
through all that code with multiple calls of
cxl_port_pick_region_decoder().
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-7-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Current function cxl_region_find_decoder() is used to find a port's
decoder during region setup. In the region creation path the function
is an allocator to find a free port. In the region assembly path, it
is recalling the decoder that platform firmware picked for validation
purposes.
Rename function to cxl_port_pick_region_decoder() that better
describes its use and update the function's description.
The result of cxl_port_pick_region_decoder() is recorded in a 'struct
cxl_region_ref' in @port for later recall when other endpoints might
also be targets of the picked decoder.
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-6-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Often a parent port must be determined. Introduce the parent_port_of()
helper function to avoid open coding of determination of a parent
port.
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-5-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Remove unnecessary 'else' after return. Improves readability of code.
It is easier to place comments. Check and fix all occurrences under
drivers/cxl/.
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Tested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250509150700.2817697-2-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
When validating decoder IW/IG when setting up regions, the granularity
is irrelevant when iw=1 - all accesses will always route to the only
target anyway - so all ig values are "correct". Loosen the requirement
that `ig = (parent_iw * parent_ig)` when iw=1.
On some Zen5 platforms, the platform BIOS specifies a 256-byte
interleave granularity window for host bridges when there is only
one target downstream. This leads to Linux rejecting the configuration
of a region with a x2 root with two x1 hostbridges.
Decoder Programming:
root - iw:2 ig:256
hb1 - iw:1 ig:256 (Linux expects 512)
hb2 - iw:1 ig:256 (Linux expects 512)
ep1 - iw:2 ig:256
ep2 - iw:2 ig:256
This change allows all decoders downstream of a passthrough decoder to
also be configured as passthrough (iw:1 ig:X), but still disallows
downstream decoders from applying subsequent interleaves.
e.g. in the above example if there was another decoder south of hb1
attempting to interleave 2 endpoints - Linux would enforce hb1.ig=512
because the southern decoder would have iw:2 and require ig=pig*piw.
[DJ: Fixed up against 6.15-rc1]
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250402232552.999634-1-gourry@gourry.net
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In extended linear cache(ELC) case, cxl_port_get_spa_cache_alias() helps
to get the aliased address of a SPA, it considers the first address in
CXL memory range is "region start + region cache size + 1", but it
should be "region start + region cache size".
So if a SPA is equal to "region start + region cache size", its aliased
address should be "SPA - region cache size".
Signed-off-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/20250317070124.815028-1-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Extended Linear Cache (ELC) setup code emits a dev_warn(), "Extended
linear cache calculation failed." for issues found while setting up
the ELC.
For platforms without CONFIG_ACPI_HMAT, every auto region setup will
emit the warning because the default !ACPI_HMAT return value is
EOPNOTSUPP. Suppress it by skipping the warn for EOPNOTSUPP. Change
the EOPNOTSUPP in the actual ELC failure path to ENXIO.
Remove the check and enusing dev_warn() when region resource size is
NULL. The endpoint decoders hpa_range used to create the resource is
checked in init_hdm_decoder(), so it cannot be NULL here.
For good measure, add the rc value to the dev_warn(). It will either
be the -ENOENT returned by HMAT if the mem target is not found, or
the -ENXIO from the region driver calculation.
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20250306213700.2606304-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Reported by kernel test bot from an ARM build:
drivers/cxl/core/region.c:3263:26: warning: format '%lld' expects argument of type 'long long int', but argument 3 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=]
On a 32bit system, resource_size_t is defined as 32bit long vs on a 64bit
system it is defined as 64bit long. Use %pa format to deal with
resource_size_t.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202503010252.mIDhZ5kY-lkp@intel.com/
Fixes: 0ec9849b6333 ("acpi/hmat / cxl: Add extended linear cache support for CXL")
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20250228204739.3849309-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Add support for Extended Linear Cache for CXL. Add enumeration support
of the cache. Add MCE notification of the aliased memory address.
|
|
A series of CXL refactoring using scope based resource management to
remove goto patterns on the cleanup paths.
|
|
Some operations need to be protected by the cxl_region_rwsem in
construct_region(). Currently, construct_region() uses down_write() and
up_write() for the cxl_region_rwsem locking, so there is a goto pattern
after down_write() invoked to release cxl_region_rwsem.
construct region() can be optimized to remove the goto pattern. The
changes are creating a new function called __construct_region() which
will include all checking and operations protected by the
cxl_region_rwsem, and using guard(rwsem_write) to replace down_write()
and up_write() in __construct_region().
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Li Ming <ming.li@zohomail.com>
Link: https://patch.msgid.link/20250221013205.126419-1-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In cxl_dax_region_alloc(), there is a goto pattern to release the rwsem
cxl_region_rwsem when the function returns, the down_read() and up_read
can be replaced by a guard(rwsem_read) then the goto pattern can be
removed.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Li Ming <ming.li@zohomail.com>
Link: https://patch.msgid.link/20250221012453.126366-7-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Some down/up_read() and down/up_write() cases can be replaced by a
guard() simply to drop explicit unlock invoked. It helps to align coding
style with current CXL subsystem's.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Li Ming <ming.li@zohomail.com>
Link: https://patch.msgid.link/20250221012453.126366-2-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Below is a setup with extended linear cache configuration with an example
layout of memory region shown below presented as a single memory region
consists of 256G memory where there's 128G of DRAM and 128G of CXL memory.
The kernel sees a region of total 256G of system memory.
128G DRAM 128G CXL memory
|-----------------------------------|-------------------------------------|
Data resides in either DRAM or far memory (FM) with no replication. Hot
data is swapped into DRAM by the hardware behind the scenes. When error is
detected in one location, it is possible that error also resides in the
aliased location. Therefore when a memory location that is flagged by MCE
is part of the special region, the aliased memory location needs to be
offlined as well.
Add an mce notify callback to identify if the MCE address location is part
of an extended linear cache region and handle accordingly.
Added symbol export to set_mce_nospec() in x86 code in order to call
set_mce_nospec() from the CXL MCE notify callback.
Link: https://lore.kernel.org/linux-cxl/668333b17e4b2_5639294fd@dwillia2-xfh.jf.intel.com.notmuch/
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20250226162224.3633792-5-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Add the aliased address of extended linear cache when emitting event
trace for poison, DRAM and general media of CXL events.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20250226162224.3633792-4-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
The current cxl region size only indicates the size of the CXL memory
region without accounting for the extended linear cache size. Retrieve the
cache size from HMAT and append that to the cxl region size for the cxl
region range that matches the SRAT range that has extended linear cache
enabled.
The SRAT defines the whole memory range that includes the extended linear
cache and the CXL memory region. The new HMAT ECN/ECR to the Memory Side
Cache Information Structure defines the size of the extended linear cache
size and matches to the SRAT Memory Affinity Structure by the memory
proxmity domain. Add a helper to match the cxl range to the SRAT memory
range in order to retrieve the cache size.
There are several places that checks the cxl region range against the
decoder range. Use new helper to check between the two ranges and address
the new cache size.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20250226162224.3633792-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Now that the operational mode of DPA capacity (ram vs pmem... etc) is
tracked in the partition, and no code paths have dependencies on the
mode implying the partition index, the ambiguous 'enum cxl_decoder_mode'
can be cleaned up, specifically this ambiguity on whether the operation
mode implied anything about the partition order.
Endpoint decoders simply reference their assigned partition where the
operational mode can be retrieved as partition mode.
With this in place PMEM can now be partition0 which happens today when
the RAM capacity size is zero. Dynamic RAM can appear above PMEM when
DCD arrives, etc. Code sequences that hard coded the "PMEM after RAM"
assumption can now just iterate partitions and consult the partition
mode after the fact.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Alejandro Lucero <alucerop@amd.com>
Link: https://patch.msgid.link/173864306972.668823.3327008645125276726.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In preparation for consolidating all DPA partition information into an
array of DPA metadata, introduce helpers that hide the layout of the
current data. I.e. make the eventual replacement of ->ram_res,
->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a
no-op for code paths that consume that information, and reduce the noise
of follow-on patches.
The end goal is to consolidate all DPA information in 'struct
cxl_dev_state', but for now the helpers just make it appear that all DPA
metadata is relative to @cxlds.
As the conversion to generic partition metadata walking is completed,
these helpers will naturally be eliminated, or reduced in scope.
Cc: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Alejandro Lucero <alucerop@amd.com>
Link: https://patch.msgid.link/173864305238.668823.16553986866633608541.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
CXL_DECODER_MIXED is a safety mechanism introduced for the case where
platform firmware has programmed an endpoint decoder that straddles a
DPA partition boundary. While the kernel is careful to only allocate DPA
capacity within a single partition there is no guarantee that platform
firmware, or anything that touched the device before the current kernel,
gets that right.
However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
designation because of the way it tracks partition boundaries. A
request_resource() that spans ->ram_res and ->pmem_res fails with the
following signature:
__cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation
CXL_DECODER_MIXED is dead defensive programming after the driver has
already given up on the device. It has never offered any protection in
practice, just delete it.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Alejandro Lucero <alucerop@amd.com>
Link: https://patch.msgid.link/173864304660.668823.17000888505587850279.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
We need the debugfs / driver-core fixes in here as well for testing and
to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
For API device_for_each_child_reverse_from(..., const void *data,
int (*fn)(struct device *dev, const void *data))
- Type of @data is const pointer, and means caller's data @*data is not
allowed to be modified, but that usually is not proper for such non
finding device iterating API.
- Types for both @data and @fn are not consistent with all other
for_each device iterating APIs device_for_each_child(_reverse)(),
bus_for_each_dev() and (driver|class)_for_each_device().
Correct its prototype by removing const from parameter types, then adapt
for various existing usages.
An dedicated typedef device_iter_t will be introduced as @fn() type for
various for_each device interating APIs later.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/r/20250105-class_fix-v6-6-3a2f1768d4d4@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Constify the following API:
struct device *device_find_child(struct device *dev, void *data,
int (*match)(struct device *dev, void *data));
To :
struct device *device_find_child(struct device *dev, const void *data,
device_match_t match);
typedef int (*device_match_t)(struct device *dev, const void *data);
with the following reasons:
- Protect caller's match data @*data which is for comparison and lookup
and the API does not actually need to modify @*data.
- Make the API's parameters (@match)() and @data have the same type as
all of other device finding APIs (bus|class|driver)_find_device().
- All kinds of existing device match functions can be directly taken
as the API's argument, they were exported by driver core.
Constify the API and adapt for various existing usages.
BTW, various subsystem changes are squashed into this commit to meet
'git bisect' requirement, and this commit has the minimal and simplest
changes to complement squashing shortcoming, and that may bring extra
code improvement.
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Acked-by: Uwe Kleine-König <ukleinek@kernel.org> # for drivers/pwm
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Link: https://lore.kernel.org/r/20241224-const_dfc_done-v5-4-6623037414d4@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The cxl_port_setup_targets() algorithm fails to identify valid target list
ordering in the presence of 4-way and above switches resulting in
'cxl create-region' failures of the form:
$ cxl create-region -d decoder0.0 -g 1024 -s 2G -t ram -w 8 -m mem4 mem1 mem6 mem3 mem2 mem5 mem7 mem0
cxl region: create_region: region0: failed to set target7 to mem0
cxl region: cmd_create_region: created 0 regions
[kernel debug message]
check_last_peer:1213: cxl region0: pci0000:0c:port1: cannot host mem6:decoder7.0 at 2
bus_remove_device:574: bus: 'cxl': remove device region0
QEMU can create this failing topology:
ACPI0017:00 [root0]
|
HB_0 [port1]
/ \
RP_0 RP_1
| |
USP [port2] USP [port3]
/ / \ \ / / \ \
DSP DSP DSP DSP DSP DSP DSP DSP
| | | | | | | |
mem4 mem6 mem2 mem7 mem1 mem3 mem5 mem0
Pos: 0 2 4 6 1 3 5 7
HB: Host Bridge
RP: Root Port
USP: Upstream Port
DSP: Downstream Port
...with the following command steps:
$ qemu-system-x86_64 -machine q35,cxl=on,accel=tcg \
-smp cpus=8 \
-m 8G \
-hda /home/work/vm-images/centos-stream8-02.qcow2 \
-object memory-backend-ram,size=4G,id=m0 \
-object memory-backend-ram,size=4G,id=m1 \
-object memory-backend-ram,size=2G,id=cxl-mem0 \
-object memory-backend-ram,size=2G,id=cxl-mem1 \
-object memory-backend-ram,size=2G,id=cxl-mem2 \
-object memory-backend-ram,size=2G,id=cxl-mem3 \
-object memory-backend-ram,size=2G,id=cxl-mem4 \
-object memory-backend-ram,size=2G,id=cxl-mem5 \
-object memory-backend-ram,size=2G,id=cxl-mem6 \
-object memory-backend-ram,size=2G,id=cxl-mem7 \
-numa node,memdev=m0,cpus=0-3,nodeid=0 \
-numa node,memdev=m1,cpus=4-7,nodeid=1 \
-netdev user,id=net0,hostfwd=tcp::2222-:22 \
-device virtio-net-pci,netdev=net0 \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
-device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
-device cxl-upstream,bus=root_port0,id=us0 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
-device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
-device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
-device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
-device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
-device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
-device cxl-upstream,bus=root_port1,id=us1 \
-device cxl-downstream,port=4,bus=us1,id=swport4,chassis=0,slot=8 \
-device cxl-type3,bus=swport4,volatile-memdev=cxl-mem4,id=cxl-vmem4 \
-device cxl-downstream,port=5,bus=us1,id=swport5,chassis=0,slot=9 \
-device cxl-type3,bus=swport5,volatile-memdev=cxl-mem5,id=cxl-vmem5 \
-device cxl-downstream,port=6,bus=us1,id=swport6,chassis=0,slot=10 \
-device cxl-type3,bus=swport6,volatile-memdev=cxl-mem6,id=cxl-vmem6 \
-device cxl-downstream,port=7,bus=us1,id=swport7,chassis=0,slot=11 \
-device cxl-type3,bus=swport7,volatile-memdev=cxl-mem7,id=cxl-vmem7 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=32G &
In Guest OS:
$ cxl create-region -d decoder0.0 -g 1024 -s 2G -t ram -w 8 -m mem4 mem1 mem6 mem3 mem2 mem5 mem7 mem0
Fix the method to calculate @distance by iterativeley multiplying the
number of targets per switch port. This also follows the algorithm
recommended here [1].
Fixes: 27b3f8d13830 ("cxl/region: Program target lists")
Link: http://lore.kernel.org/6538824b52349_7258329466@dwillia2-xfh.jf.intel.com.notmuch [1]
Signed-off-by: Huaisheng Ye <huaisheng.ye@intel.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
[djbw: add a comment explaining 'distance']
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/173378716722.1270362.9546805175813426729.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Clean up the existing export namespace code along the same lines of
commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.
Scripted using
git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
do
awk -i inplace '
/^#define EXPORT_SYMBOL_NS/ {
gsub(/__stringify\(ns\)/, "ns");
print;
next;
}
/^#define MODULE_IMPORT_NS/ {
gsub(/__stringify\(ns\)/, "ns");
print;
next;
}
/MODULE_IMPORT_NS/ {
$0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
}
/EXPORT_SYMBOL_NS/ {
if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
$0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
$0 !~ /^my/) {
getline line;
gsub(/[[:space:]]*\\$/, "");
gsub(/[[:space:]]/, "", line);
$0 = $0 " " line;
}
$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
"\\1(\\2, \"\\3\")", "g");
}
}
{ print }' $file;
done
Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl updates from Dave Jiang:
- Constify range_contains() input parameters to prevent changes
- Add support for displaying RCD capabilities in sysfs to support lspci
for CXL device
- Downgrade warning message to debug in cxl_probe_component_regs()
- Add support for adding a printf specifier '%pra' to emit 'struct
range' content:
- Add sanity tests for 'struct resource'
- Add documentation for special case
- Add %pra for 'struct range'
- Add %pra usage in CXL code
- Add preparation code for DCD support:
- Add range_overlaps()
- Add CDAT DSMAS table shared and read only flag in ACPICA
- Add documentation to 'struct dev_dax_range'
- Delay event buffer allocation in CXL PCI code until needed
- Use guard() in cxl_dpa_set_mode()
- Refactor create region code to consolidate common code
* tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Refactor common create region code
cxl/hdm: Use guard() in cxl_dpa_set_mode()
cxl/pci: Delay event buffer allocation
dax: Document struct dev_dax_range
ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
range: Add range_overlaps()
cxl/cdat: Use %pra for dpa range outputs
printf: Add print format (%pra) for struct range
Documentation/printf: struct resource add start == end special case
test printf: Add very basic struct resource tests
cxl: downgrade a warning message to debug level in cxl_probe_component_regs()
cxl/pci: Add sysfs attribute for CXL 1.1 device link status
cxl/core/regs: Add rcd_pcie_cap initialization
kernel/range: Const-ify range_contains parameters
|
|
create_pmem_region_store() and create_ram_region_store() are identical
with the exception of the region mode. With the addition of DC region
mode this would end up being 3 copies of the same code.
Refactor create_pmem_region_store() and create_ram_region_store() to use
a single common function to be used in subsequent DC code.
Suggested-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/20241107-dcd-type2-upstream-v7-6-56a84e66bc36@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
With the recent change to allow out-of-order decoder de-commit it
highlights a need to strengthen the in-order decoder commit guarantees.
As it stands match_free_decoder() ensures that if 2 regions are racing
decoder allocations the one that wins the race will get the lower id
decoder, but that still leaves the race to *commit* the decoder.
Rather than have this complicated case of "reserved in-order, but may
still commit out-of-order", just arrange for the reservation order to
match the commit-order. In other words, prevent subsequent allocations
until the last reservation is committed.
This precludes overlapping region creation events and requires the
previous regionN to either move forward to the decoder commit stage or
drop its reservation before regionN+1 can move forward. That is,
provided that regionN and regionN+1 decode through the same switch port.
As a side effect this allows match_free_decoder() to drop its dependency
on needing write access to the device_find_child() @data parameter [1].
Reported-by: Zijun Hu <quic_zijuhu@quicinc.com>
Closes: http://lore.kernel.org/20240905-const_dfc_prepare-v4-0-4180e1d5a244@quicinc.com
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/172964783668.81806.14962699553881333486.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
|
|
In support of investigating an initialization failure report [1],
cxl_test was updated to register mock memory-devices after the mock
root-port/bus device had been registered. That led to cxl_test crashing
with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
[..]
cxld_unregister: cxl decoder14.0:
cxl_region_decode_reset: cxl_region region3:
mock_decoder_reset: cxl_port port3: decoder3.0 reset
2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
cxl_endpoint_decoder_release: cxl decoder14.0:
[..]
cxld_unregister: cxl decoder7.0:
3) cxl_region_decode_reset: cxl_region region3:
Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
[..]
RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
[..]
Call Trace:
<TASK>
cxl_region_decode_reset+0x69/0x190 [cxl_core]
cxl_region_detach+0xe8/0x210 [cxl_core]
cxl_decoder_kill_region+0x27/0x40 [cxl_core]
cxld_unregister+0x5d/0x60 [cxl_core]
At 1) a region has been established with 2 endpoint decoders (7.0 and
14.0). Those endpoints share a common switch-decoder in the topology
(3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
the "out of order reset case" in the switch decoder. The effect though
is that region3 cleanup is aborted leaving it in-tact and
referencing decoder14.0. At 3) the second attempt to teardown region3
trips over the stale decoder14.0 object which has long since been
deleted.
The fix here is to recognize that the CXL specification places no
mandate on in-order shutdown of switch-decoders, the driver enforces
in-order allocation, and hardware enforces in-order commit. So, rather
than fail and leave objects dangling, always remove them.
In support of making cxl_region_decode_reset() always succeed,
cxl_region_invalidate_memregion() failures are turned into warnings.
Crashing the kernel is ok there since system integrity is at risk if
caches cannot be managed around physical address mutation events like
CXL region destruction.
A new device_for_each_child_reverse_from() is added to cleanup
port->commit_end after all dependent decoders have been disabled. In
other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
port->commit_end only decrements from 2 after 2 has been disabled, and
it decrements all the way to zero since 1 was disabled previously.
Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1]
Cc: stable@vger.kernel.org
Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/172964782781.81806.17902885593105284330.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
|
|
The current bandwidth calculation aggregates all the targets. This simple
method does not take into account where multiple targets sharing under
a switch or a root port where the aggregated bandwidth can be greater than
the upstream link of the switch.
To accurately account for the shared upstream uplink cases, a new update
function is introduced by walking from the leaves to the root of the
hierarchy and clamp the bandwidth in the process as needed. This process
is done when all the targets for a region are present but before the
final values are send to the HMAT handling code cached access_coordinate
targets.
The original perf calculation path was kept to calculate the latency
performance data that does not require the shared link consideration.
The shared upstream link calculation is done as a second pass when all
the endpoints have arrived.
Testing is done via qemu with CXL hierarchy. run_qemu[1] is modified to
support several CXL hierarchy layouts. The following layouts are tested:
HB: Host Bridge
RP: Root Port
SW: Switch
EP: End Point
2 HB 2 RP 2 EP: resulting bandwidth: 624
1 HB 2 RP 2 EP: resulting bandwidth: 624
2 HB 2 RP 2 SW 4 EP: resulting bandwidth: 624
Current testing, perf number from SRAT/HMAT is hacked into the kernel
code. However with new QEMU support of Generic Target Port that's
incoming, the perf data injection is no longer needed.
[1]: https://github.com/pmem/run_qemu
Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://lore.kernel.org/linux-cxl/20240501152503.00002e60@Huawei.com/
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20240904001316.1688225-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In testing Dynamic Capacity Device (DCD) support, a lockdep splat
revealed an ABBA issue between the memory notifiers and the DCD extent
processing code.[0] Changing the lock ordering within DCD proved
difficult because regions must be stable while searching for the proper
region and then the device lock must be held to properly notify the DAX
region driver of memory changes.
Dan points out in the thread that notifiers should be able to trust that
it is safe to access static data. Region data is static once the device
is realized and until it's destruction. Thus it is better to manage the
notifiers within the region driver.
Remove the need for a lock by ensuring the notifiers are active only
during the region's lifetime.
Furthermore, remove cxl_region_nid() because resource can't be NULL
while the region is stable.
Link: https://lore.kernel.org/all/66b4cf539a79b_a36e829416@iweiny-mobl.notmuch/ [0]
Cc: Ying Huang <ying.huang@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/20240904-fix-notifiers-v3-1-576b4e950266@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
A device_lock() and device_unlock() pair can be replaced by a cleanup
helper scoped_guard() or guard(), that can enhance code readability. In
CXL subsystem, still use device_lock() and device_unlock() pairs for cxl
port resource protection, most of them can be replaced by a
scoped_guard() or a guard() simply.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Li Ming <ming4.li@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20240830013138.2256244-2-ming4.li@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:
"Core:
- A CXL maturity map has been added to the documentation to detail
the current state of CXL enabling.
It provides the status of the current state of various CXL features
to inform current and future contributors of where things are and
which areas need contribution.
- A notifier handler has been added in order for a newly created CXL
memory region to trigger the abstract distance metrics calculation.
This should bring parity for CXL memory to the same level vs
hotplugged DRAM for NUMA abstract distance calculation. The
abstract distance reflects relative performance used for memory
tiering handling.
- An addition for XOR math has been added to address the CXL DPA to
SPA translation.
CXL address translation did not support address interleave math
with XOR prior to this change.
Fixes:
- Fix to address race condition in the CXL memory hotplug notifier
- Add missing MODULE_DESCRIPTION() for CXL modules
- Fix incorrect vendor debug UUID define
Misc:
- A warning has been added to inform users of an unsupported
configuration when mixing CXL VH and RCH/RCD hierarchies
- The ENXIO error code has been replaced with EBUSY for inject poison
limit reached via debugfs and cxl-test support
- Moving the PCI config read in cxl_dvsec_rr_decode() to avoid
unnecessary PCI config reads
- A refactor to a common struct for DRAM and general media CXL
events"
* tag 'cxl-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/core/pci: Move reading of control register to immediately before usage
cxl: Remove defunct code calculating host bridge target positions
cxl/region: Verify target positions using the ordered target list
cxl: Restore XOR'd position bits during address translation
cxl/core: Fold cxl_trace_hpa() into cxl_dpa_to_hpa()
cxl/test: Replace ENXIO with EBUSY for inject poison limit reached
cxl/memdev: Replace ENXIO with EBUSY for inject poison limit reached
cxl/acpi: Warn on mixed CXL VH and RCH/RCD Hierarchy
cxl/core: Fix incorrect vendor debug UUID define
Documentation: CXL Maturity Map
cxl/region: Simplify cxl_region_nid()
cxl/region: Support to calculate memory tier abstract distance
cxl/region: Fix a race condition in memory hotplug notifier
cxl: add missing MODULE_DESCRIPTION() macros
cxl/events: Use a common struct for DRAM and General Media events
|
|
Series to fix XOR math for DPA to SPA translation
- Refactor and fold cxl_trace_hpa() into cxl_dpa_to_hpa()
- Complete DPA->HPA->SPA translation and correct XOR translation issue
- Add new method to verify a CXL target position
- Remove old method of CXL target position verifiation
|
|
When a root decoder is configured the interleave target list is read
from the BIOS populated CFMWS structure. Per the CXL spec 3.1 Table
9-22 the target list is in interleave order. The CXL driver populates
its decoder target list in the same order and stores it in 'struct
cxl_switch_decoder' field "@target: active ordered target list in
current decoder configuration"
Given the promise of an ordered list, the driver can stop duplicating
the work of BIOS and simply check target positions against the ordered
list during region configuration.
The simplified check against the ordered list is presented here.
A follow-on patch will remove the unused code.
For Modulo arithmetic this is not a fix, only a simplification.
For XOR arithmetic this is a fix for HB IW of 3,6,12.
Fixes: f9db85bfec0d ("cxl/acpi: Support CXL XOR Interleave Math (CXIMS)")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/35d08d3aba08fee0f9b86ab1cef0c25116ca8a55.1719980933.git.alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
When a device reports a DPA in events like poison, general_media,
and dram, the driver translates that DPA back to an HPA. Presently,
the CXL driver translation only considers the Modulo position and
will report the wrong HPA for XOR configured root decoders.
Add a helper function that restores the XOR'd bits during DPA->HPA
address translation. Plumb a root decoder callback to the new helper
when XOR interleave arithmetic is in use. For Modulo arithmetic, just
let the callback be NULL - as in no extra work required.
Upon completion of a DPA->HPA translation a couple of checks are
performed on the result. One simply confirms that the calculated
HPA is within the address range of the region. That test is useful
for both Modulo and XOR interleave arithmetic decodes.
A second check confirms that the HPA is within an expected chunk
based on the endpoints position in the region and the region
granularity. An XOR decode disrupts the Modulo pattern making the
chunk check useless.
To align the checks with the proper decode, pull the region range
check inline and use the helper to do the chunk check for Modulo
decodes only.
A cxl-test unit test is posted for upstream review here:
https://lore.kernel.org/20240624210644.495563-1-alison.schofield@intel.com/
Fixes: 28a3ae4ff66c ("cxl/trace: Add an HPA to cxl_poison trace events")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/1a1ac880d9f889bd6384e657e810431b9a0a72e5.1719980933.git.alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Although cxl_trace_hpa() is used to populate TRACE EVENTs with HPA
addresses the work it performs is a DPA to HPA translation not a
trace. Tidy up this naming by moving the minimal work done in
cxl_trace_hpa() into cxl_dpa_to_hpa() and use cxl_dpa_to_hpa()
for trace event callbacks.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Robert Richter <rrichter@amd.com>
Link: https://patch.msgid.link/452a9b0c525b774c72d9d5851515ffa928750132.1719980933.git.alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
The node ID of the region can be gotten via resource start address
directly. This simplifies the implementation of cxl_region_nid().
Signed-off-by: Huang Ying <ying.huang@intel.com>
Suggested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20240618084639.1419629-4-ying.huang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
An abstract distance value must be assigned by the driver that makes
the memory available to the system. It reflects relative performance
and is used to place memory nodes backed by CXL regions in the appropriate
memory tiers allowing promotion/demotion within the existing memory tiering
mechanism.
The abstract distance is calculated based on the memory access latency
and bandwidth of CXL regions.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20240618084639.1419629-3-ying.huang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
In the memory hotplug notifier function of the CXL region,
cxl_region_perf_attrs_callback(), the node ID is obtained by checking
the host address range of the region. However, the address range
information is not available when the region is registered in
devm_cxl_add_region(). Additionally, this information may be removed
or added under the protection of cxl_region_rwsem during runtime. If
the memory notifier is called for nodes other than that backed by the
region, a race condition may occur, potentially leading to a NULL
dereference or an invalid address range.
The race condition is addressed by checking the availability of the
address range information under the protection of cxl_region_rwsem. To
enhance code readability and use guard(), the relevant code has been
moved into a newly added function: cxl_region_nid().
Fixes: 067353a46d8c ("cxl/region: Add memory hotplug notifier for cxl region")
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20240618084639.1419629-2-ying.huang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Since interleave capability is not verified, if the interleave
capability of a target does not match the region need, committing decoder
should have failed at the device end.
In order to checkout this error as quickly as possible, driver needs
to check the interleave capability of target during attaching it to
region.
Per CXL specification r3.1(8.2.4.20.1 CXL HDM Decoder Capability Register),
bits 11 and 12 indicate the capability to establish interleaving in 3, 6,
12 and 16 ways. If these bits are not set, the target cannot be attached to
a region utilizing such interleave ways.
Additionally, bits 8 and 9 represent the capability of the bits used for
interleaving in the address, Linux tracks this in the cxl_port
interleave_mask.
Per CXL specification r3.1(8.2.4.20.13 Decoder Protection):
eIW means encoded Interleave Ways.
eIG means encoded Interleave Granularity.
in HPA:
if eIW is 0 or 8 (interleave ways: 1, 3), all the bits of HPA are used,
the interleave bits are none, the following check is ignored.
if eIW is less than 8 (interleave ways: 2, 4, 8, 16), the interleave bits
start at bit position eIG + 8 and end at eIG + eIW + 8 - 1.
if eIW is greater than 8 (interleave ways: 6, 12), the interleave bits
start at bit position eIG + 8 and end at eIG + eIW - 1.
if the interleave mask is insufficient to cover the required interleave
bits, the target cannot be attached to the region.
Fixes: 384e624bb211 ("cxl/region: Attach endpoint decoders")
Signed-off-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20240614084755.59503-2-yaoxt.fnst@fujitsu.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
cxl_dpa_to_region() looks up a region based on a memdev and DPA.
It wrongly assumes an endpoint found mapping the DPA is also of
a fully assembled region. When not true it leads to a null pointer
dereference looking up the region name.
This appears during testing of region lookup after a failure to
assemble a BIOS defined region or if the lookup raced with the
assembly of the BIOS defined region.
Failure to clean up BIOS defined regions that fail assembly is an
issue in itself and a fix to that problem will alleviate some of
the impact. It will not alleviate the race condition so let's harden
this path.
The behavior change is that the kernel oops due to a null pointer
dereference is replaced with a dev_dbg() message noting that an
endpoint was mapped.
Additional comments are added so that future users of this function
can more clearly understand what it provides.
Fixes: 0a105ab28a4d ("cxl/memdev: Warn of poison inject or clear to a mapped region")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20240604003609.202682-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|