summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
AgeCommit message (Collapse)Author
2023-06-09drm/amdkfd: fix and enable debugging for gfx11Jonathan Kim
There are a couple of fixes required to enable gfx11 debugging. First, ADD_QUEUE.trap_en is an inappropriate place to toggle a per-process register so move it to SET_SHADER_DEBUGGER.trap_en. When ADD_QUEUE.skip_process_ctx_clear is set, MES will prioritize the SET_SHADER_DEBUGGER.trap_en setting. Second, to preserve correct save/restore priviledged wave states in coordination with the trap enablement setting, resume suspended waves early in the disable call. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: apply trap workaround for gfx11Jonathan Kim
Due to a HW bug, waves in only half the shader arrays can enter trap. When starting a debug session, relocate all waves to the first shader array of each shader engine and mask off the 2nd shader array as unavailable. When ending a debug session, re-enable the 2nd shader array per shader engine. User CU masking per queue cannot be guaranteed to remain functional if requested during debugging (e.g. user cu mask requests only 2nd shader array as an available resource leading to zero HW resources available) nor can runtime be alerted of any of these changes during execution. Make user CU masking and debugging mutual exclusive with respect to availability. If the debugger tries to attach to a process with a user cu masked queue, return the runtime status as enabled but busy. If the debugger tries to attach and fails to reallocate queue waves to the first shader array of each shader engine, return the runtime status as enabled but with an error. In addition, like any other mutli-process debug supported devices, disable trap temporary setup per-process to avoid performance impact from setup overhead. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: display debug capabilitiesJonathan Kim
Expose debug capabilities in the KFD topology node's HSA capabilities and debug properties flags. Ensure correct capabilities are exposed based on firmware support. Flag definitions can be referenced in uapi/linux/kfd_sysfs.h. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Do not access members of xcp w/o check (v2)Hawking Zhang
Not all the asic needs xcp. ensure check xcp availabity before accessing its member. v2: add missing change in kfd_topology.c Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Move local_mem_info to kfd_nodeMukul Joshi
We need to track memory usage on a per partition basis. To do that, store the local memory information in KFD node instead of kfd device. v2: squash in fix ("amdkfd: Use mem_id to access mem_partition info") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: KFD graphics interop support compute partitionPhilip Yang
kfd_ioctl_get_dmabuf use the amdgpu bo xcp_id to get the gpu_id of the KFD node from the exported dmabuf_adev, and then create kfd bo on the correct adev and KFD node when importing the amdgpu bo to KFD. Remove function kfd_device_by_adev, it is not needed as it is the same result as dmabuf_adev->kfd.dev->nodes[0]->id. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Store drm node minor number for kfd nodesPhilip Yang
From KFD topology, application will find kfd node with the corresponding drm device node minor number, for example if partition drm node starts from /dev/dri/renderD129, then KFD node 0 with store drm node minor number 129. Application will open drm node /dev/dri/renderD129 to create amdgpu vm for kfd node 0 with the correct vm->mem_id to indicate the memory partition. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Move pgmap to amdgpu_kfd_dev structurePhilip Yang
VRAM pgmap resource is allocated every time when switching compute partitions because kfd_dev is re-initialized by post_partition_switch, As a result, it causes memory region resource leaking and system memory usage accounting unbalanced. pgmap resource should be allocated and registered only once when loading driver and freed when unloading driver, move it from kfd_dev to amdgpu_kfd_dev. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Use xcc mask for identifying xccLijo Lazar
Instead of start xcc id and number of xcc per node, use the xcc mask which is the mask of logical ids of xccs belonging to a parition. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: bind cpu and hiveless gpu to a hive if xgmi connectedJonathan Kim
If a CPU and GPU are xGMI connected but the GPU is hiveless with respect to other GPUs, create a new CPU-GPU hive using the GPU's PCI device location ID as the new hive ID to maintain fine grain memory access usage. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Report XGMI IOLINKs for GFXIP9.4.3Rajneesh Bhardwaj
GFXIP 9.4.3 could be in APU or carveout mode but we cannot use the xgmi.connected_to_cpu flag to identify the iolinks type. Use appropriate APU or Carveout mode based condition to report xgmi connection in kfd topology. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: add gpu compute cores io links for gfx9.4.3Jonathan Kim
The PSP TA will only provide xGMI topology info for links between GPU sockets so links between partitions from different sockets will be hardcoded as 3 xGMI hops with 1 hops weighted as xGMI and 2 hops weighted with a new intra-socket weight to indicate the longest possible distance. If the link between a partition and the CPU is non-PCIe, then assume the CPU (CCDs) is located within the same socket as the partition and represent the link as an intra-socket weighted single hop XGMI link with memory bandwidth. Links between partitions within a single socket will be abstracted as single hop xGMI links weighted with the new intra-socket weight and will have memory bandwidth. Finally, use the unused function bits in the location ID to represent the coordinates of the compute partition within its socket. A follow on patch will resolve the requirement for GPU socket xGMI link representation sometime later. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Update sysfs node properties for multi XCCMukul Joshi
Update simd_count and array_count node properties to report values multiplied by number of XCCs in the partition. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Add spatial partitioning support in KFDMukul Joshi
This patch introduces multi-partition support in KFD. This patch includes: - Support for maximum 8 spatial partitions in KFD. - Initialize one HIQ per partition. - Management of VMID range depending on partition mode. - Management of doorbell aperture space between all partitions. - Each partition does its own queue management, interrupt handling, SMI event reporting. - IOMMU, if enabled with multiple partitions, will only work on first partition. - SPM is only supported on the first partition. - Currently, there is no support for resetting individual partitions. All partitions will reset together. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Introduce kfd_node struct (v5)Mukul Joshi
Introduce a new structure, kfd_node, which will now represent a compute node. kfd_node is carved out of kfd_dev structure. kfd_dev struct now will become the parent of kfd_node, and will store common resources such as doorbells, GTT sub-alloctor etc. kfd_node struct will store all resources specific to a compute node, such as device queue manager, interrupt handling etc. This is the first step in adding compute partition support in KFD. v2: introduce kfd_node struct to gc v11 (Hawking) v3: make reference to kfd_dev struct through kfd_node (Morris) v4: use kfd_node instead for kfd isr/mqd functions (Morris) v5: rebase (Alex) Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-23drm/amdkfd: Make kobj_type structures constantThomas Weißschuh
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03drm/amdkfd: Fix kernel warning during topology setupMukul Joshi
This patch fixes the following kernel warning seen during driver load by correctly initializing the p2plink attr before creating the sysfs file: [ +0.002865] ------------[ cut here ]------------ [ +0.002327] kobject: '(null)' (0000000056260cfb): is not initialized, yet kobject_put() is being called. [ +0.004780] WARNING: CPU: 32 PID: 1006 at lib/kobject.c:718 kobject_put+0xaa/0x1c0 [ +0.001361] Call Trace: [ +0.001234] <TASK> [ +0.001067] kfd_remove_sysfs_node_entry+0x24a/0x2d0 [amdgpu] [ +0.003147] kfd_topology_update_sysfs+0x3d/0x750 [amdgpu] [ +0.002890] kfd_topology_add_device+0xbd7/0xc70 [amdgpu] [ +0.002844] ? lock_release+0x13c/0x2e0 [ +0.001936] ? smu_cmn_send_smc_msg_with_param+0x1e8/0x2d0 [amdgpu] [ +0.003313] ? amdgpu_dpm_get_mclk+0x54/0x60 [amdgpu] [ +0.002703] kgd2kfd_device_init.cold+0x39f/0x4ed [amdgpu] [ +0.002930] amdgpu_amdkfd_device_init+0x13d/0x1f0 [amdgpu] [ +0.002944] amdgpu_device_init.cold+0x1464/0x17b4 [amdgpu] [ +0.002970] ? pci_bus_read_config_word+0x43/0x80 [ +0.002380] amdgpu_driver_load_kms+0x15/0x100 [amdgpu] [ +0.002744] amdgpu_pci_probe+0x147/0x370 [amdgpu] [ +0.002522] local_pci_probe+0x40/0x80 [ +0.001896] work_for_cpu_fn+0x10/0x20 [ +0.001892] process_one_work+0x26e/0x5a0 [ +0.002029] worker_thread+0x1fd/0x3e0 [ +0.001890] ? process_one_work+0x5a0/0x5a0 [ +0.002115] kthread+0xea/0x110 [ +0.001618] ? kthread_complete_and_exit+0x20/0x20 [ +0.002422] ret_from_fork+0x1f/0x30 [ +0.001808] </TASK> [ +0.001103] irq event stamp: 59837 [ +0.001718] hardirqs last enabled at (59849): [<ffffffffb30fab12>] __up_console_sem+0x52/0x60 [ +0.004414] hardirqs last disabled at (59860): [<ffffffffb30faaf7>] __up_console_sem+0x37/0x60 [ +0.004414] softirqs last enabled at (59654): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130 [ +0.004205] softirqs last disabled at (59649): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130 [ +0.004203] ---[ end trace 0000000000000000 ]--- Fixes: 0f28cca87e9a ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29drm/amdkfd: Remove unnecessary condition in kfd_topology_add_device()Dan Carpenter
We re-arranged this code recently so "ret" is always zero at this point. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-23drm/amdkfd: Release the topology_lock in error caseFelix Kuehling
Move the topology-locked part of kfd_topology_add_device into a separate function to simlpify error handling and release the topology lock consistently. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com> Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdkfd: Make kfd_fill_cache_non_crat_info() as staticMa Jun
kfd_fill_cache_non_crat_info() is only used in kfd_topology.c, so make it as static. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdkfd: Fix the warning of array-index-out-of-boundsMa Jun
For some GPUs with more CUs, the original sibling_map[32] in struct crat_subtype_cache is not enough to save the cache information when create the VCRAT table, so skip filling the struct crat_subtype_cache info instead fill struct kfd_cache_properties directly to fix this problem. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: Cleanup kfd_dev structMukul Joshi
Cleanup kfd_dev struct by removing ddev and pdev as both drm_device and pci_dev can be fetched from amdgpu_device. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16drm/amdkfd: potential crash in kfd_create_indirect_link_prop()Dan Carpenter
This code has two bugs. If kfd_topology_device_by_proximity_domain() failed on the first iteration through the loop then "cpu_link" is uninitialized and should not be dereferenced. The second bug is that we cannot dereference a list iterator when it points to the list head. In other words, if we exit the list_for_each_entry() loop exits without hitting a break then "cpu_link" is not a valid pointer and should not be dereferenced. Fix both of these problems by setting "cpu_link" to NULL when it is invalid and non-NULL when it is valid. That makes it easier to test for valid vs invalid. Fixes: 0f28cca87e9a ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-14drm/amdkfd: fix warning when CONFIG_HSA_AMD_P2P is not setAlex Deucher
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:1542:11: warning: variable 'i' set but not used [-Wunused-but-set-variable] Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10drm/amdkfd: Remove field io_link_count from struct kfd_topology_deviceRamesh Errabolu
The field is redundant and does not serve any functional role Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-08drm/amdkfd: Extend KFD device topology to surface peer-to-peer linksRamesh Errabolu
Extend KFD device topology to surface peer-to-peer links among GPU devices connected over PCIe or xGMI. Enabling HSA_AMD_P2P is REQUIRED to surface peer-to-peer links. Prior to this KFD did not expose to user mode any P2P links or indirect links that go over two or more direct hops. Old versions of the Thunk used to make up their own P2P and indirect links without the information about peer-accessibility and chipset support available to the kernel mode driver. In this patch we expose P2P links in a new sysfs directory to provide more reliable P2P link information to user mode. Old versions of the Thunk will continue to work as before and ignore the new directory. This avoids conflicts between P2P links exposed by KFD and P2P links created by the Thunk itself. New versions of the Thunk will use only the P2P links provided in the new p2p_links directory, if it exists, or fall back to the old code path on older KFDs that don't expose p2p_links. Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26drm/amdkfd: simplify cpu hive assignmentJonathan Kim
CPU hive assignment currently assumes when a GPU hive is connected_to_cpu, there is only one hive in the system. Only assign CPUs to the hive if they are explicitly directly connected to the GPU hive to get rid of the need for this assumption. It's more efficient to do this when querying IO links since other non-CRAT info has to be filled in anyways. Also, stop re-assigning the same CPU to the same GPU hive if it has already been done before. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04drm/amdkfd: Add KFD support for soc21 v3Mukul Joshi
Add initial support for soc21 in KFD compute driver (Mukul) - Add new definition for soc21 device. - Add new file for amdgpu-kfd interface for GFX11 family. - Add new file for queue management, interrupt handling, mqd management for GFX11 family in KFD driver. - Related changes/updates for soc21 device in KFD driver. - Repurpose last 2 entries of SDMA MQD for driver use. v2: Add an optional argument into update queue operation (Mukul) v3: Switch to ip version check, replace kgd_dev with amdgpu_device (Hawking) Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28drm/amdkfd: Fix circular lock dependency warningMukul Joshi
[ 168.544078] ====================================================== [ 168.550309] WARNING: possible circular locking dependency detected [ 168.556523] 5.16.0-kfd-fkuehlin #148 Tainted: G E [ 168.562558] ------------------------------------------------------ [ 168.568764] kfdtest/3479 is trying to acquire lock: [ 168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at: kfd_topology_device_by_id+0x16/0x60 [amdgpu] [ 168.583663] but task is already holding lock: [ 168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0xa9/0x180 [ 168.597755] which lock already depends on the new lock. [ 168.605970] the existing dependency chain (in reverse order) is: [ 168.613487] -> #3 (&mm->mmap_lock#2){++++}-{3:3}: [ 168.619700] lock_acquire+0xca/0x2e0 [ 168.623814] down_read+0x3e/0x140 [ 168.627676] do_user_addr_fault+0x40d/0x690 [ 168.632399] exc_page_fault+0x6f/0x270 [ 168.636692] asm_exc_page_fault+0x1e/0x30 [ 168.641249] filldir64+0xc8/0x1e0 [ 168.645115] call_filldir+0x7c/0x110 [ 168.649238] ext4_readdir+0x58e/0x940 [ 168.653442] iterate_dir+0x16a/0x1b0 [ 168.657558] __x64_sys_getdents64+0x83/0x140 [ 168.662375] do_syscall_64+0x35/0x80 [ 168.666492] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 168.672095] -> #2 (&type->i_mutex_dir_key#6){++++}-{3:3}: [ 168.679008] lock_acquire+0xca/0x2e0 [ 168.683122] down_read+0x3e/0x140 [ 168.686982] path_openat+0x5b2/0xa50 [ 168.691095] do_file_open_root+0xfc/0x190 [ 168.695652] file_open_root+0xd8/0x1b0 [ 168.702010] kernel_read_file_from_path_initns+0xc4/0x140 [ 168.709542] _request_firmware+0x2e9/0x5e0 [ 168.715741] request_firmware+0x32/0x50 [ 168.721667] amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu] [ 168.730060] smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu] [ 168.738414] fiji_start_smu+0xcf/0x4e0 [amdgpu] [ 168.745539] pp_dpm_load_fw+0x21/0x30 [amdgpu] [ 168.752503] amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu] [ 168.760698] amdgpu_device_fw_loading+0xb8/0x140 [amdgpu] [ 168.768412] amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu] [ 168.776285] amdgpu_driver_load_kms+0x15/0x120 [amdgpu] [ 168.784034] amdgpu_pci_probe+0x19b/0x3a0 [amdgpu] [ 168.791161] local_pci_probe+0x40/0x80 [ 168.797027] work_for_cpu_fn+0x10/0x20 [ 168.802839] process_one_work+0x273/0x5b0 [ 168.808903] worker_thread+0x20f/0x3d0 [ 168.814700] kthread+0x176/0x1a0 [ 168.819968] ret_from_fork+0x1f/0x30 [ 168.825563] -> #1 (&adev->pm.mutex){+.+.}-{3:3}: [ 168.834721] lock_acquire+0xca/0x2e0 [ 168.840364] __mutex_lock+0xa2/0x930 [ 168.846020] amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu] [ 168.853257] amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu] [ 168.861547] kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu] [ 168.869478] kfd_create_crat_image_virtual+0x447/0x510 [amdgpu] [ 168.877884] kfd_topology_add_device+0x5c8/0x6f0 [amdgpu] [ 168.885556] kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu] [ 168.893347] amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu] [ 168.901177] amdgpu_device_init.cold+0x141b/0x1716 [amdgpu] [ 168.909025] amdgpu_driver_load_kms+0x15/0x120 [amdgpu] [ 168.916458] amdgpu_pci_probe+0x19b/0x3a0 [amdgpu] [ 168.923442] local_pci_probe+0x40/0x80 [ 168.929249] work_for_cpu_fn+0x10/0x20 [ 168.935008] process_one_work+0x273/0x5b0 [ 168.940944] worker_thread+0x20f/0x3d0 [ 168.946623] kthread+0x176/0x1a0 [ 168.951765] ret_from_fork+0x1f/0x30 [ 168.957277] -> #0 (&topology_lock){++++}-{3:3}: [ 168.965993] check_prev_add+0x8f/0xbf0 [ 168.971613] __lock_acquire+0x1299/0x1ca0 [ 168.977485] lock_acquire+0xca/0x2e0 [ 168.982877] down_read+0x3e/0x140 [ 168.987975] kfd_topology_device_by_id+0x16/0x60 [amdgpu] [ 168.995583] kfd_device_by_id+0xa/0x20 [amdgpu] [ 169.002180] kfd_mmap+0x95/0x200 [amdgpu] [ 169.008293] mmap_region+0x337/0x5a0 [ 169.013679] do_mmap+0x3aa/0x540 [ 169.018678] vm_mmap_pgoff+0xdc/0x180 [ 169.024095] ksys_mmap_pgoff+0x186/0x1f0 [ 169.029734] do_syscall_64+0x35/0x80 [ 169.035005] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 169.041754] other info that might help us debug this: [ 169.053276] Chain exists of: &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2 [ 169.068389] Possible unsafe locking scenario: [ 169.076661] CPU0 CPU1 [ 169.082383] ---- ---- [ 169.088087] lock(&mm->mmap_lock#2); [ 169.092922] lock(&type->i_mutex_dir_key#6); [ 169.100975] lock(&mm->mmap_lock#2); [ 169.108320] lock(&topology_lock); [ 169.112957] *** DEADLOCK *** This commit fixes the deadlock warning by ensuring pm.mutex is not held while holding the topology lock. For this, kfd_local_mem_info is moved into the KFD dev struct and filled during device init. This cached value can then be used instead of querying the value again and again. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28drm/amdkfd: Fix updating IO links during device removalMukul Joshi
The logic to update the IO links when a KFD device is removed was not correct as it would miss updating the proximity domain values for some nodes where the node_from and node_to both were greater values than the proximity domain value of the KFD device being removed from topology. Fixes: 46d18d510d7831 ("drm/amdkfd: Cleanup IO links during KFD device removal") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-13drm/amdkfd: Cleanup IO links during KFD device removalMukul Joshi
Currently, the IO-links to the device being removed from topology, are not cleared. As a result, there would be dangling links left in the KFD topology. This patch aims to fix the following: 1. Cleanup all IO links to the device being removed. 2. Ensure that node numbering in sysfs and nodes proximity domain values are consistent after the device is removed: a. Adding a device and removing a GPU device are made mutually exclusive. b. The global proximity domain counter is no longer required to be an atomic counter. A normal 32-bit counter can be used instead. 3. Update generation_count to let user-mode know that topology has changed due to device removal. CC: Shuotao Xu <shuotaoxu@microsoft.com> Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14drm/amdkfd: Fix leftover errors and warningsRajneesh Bhardwaj
A bunch of errors and warnings are leftover KFD over the years, attempt to fix the errors and most warnings reported by checkpatch tool. Still a few warnings remain which may be false positives so ignore them for now. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14drm/amdkfd: update SPDX license headerRajneesh Bhardwaj
Update the SPDX License header for all the KFD files. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01drm/amdkfd: add kfd_device_info_init functionGraham Sider
Initializes kfd->device_info given either asic_type (enum) if GFX version is less than GFX9, or GC IP version if greater. Also takes in vf and the target compiler gfx version. Uses SDMA version to determine num_sdma_queues_per_engine. Convert device_info to a non-pointer member of kfd, change references accordingly. Change unsupported asic condition to only probe f2g, move device_info initialization post-switch. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01drm/amdkfd: replace asic_name with amdgpu_asic_nameGraham Sider
device_info->asic_name and amdgpu_asic_name[adev->asic_type] both provide asic name strings, with the only difference being casing. Remove asic_name from device_info and replace sysfs entry with lowercase amdgpu_asic_name[]. Ensures string is null-terminated so that this doesn't break if dev->node_props.name ever gets set anywhere else. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22drm/amdkfd: Retrieve SDMA numbers from amdgpuAmber Lin
Instead of hard coding the number of sdma engines and the number of sdma_xgmi engines in the device_info table, get the number of toal SDMA instances from amdgpu. The first two engines are sdma engines and the rest are sdma-xgmi engines unless the ASIC doesn't support XGMI. v2: add kfd_ prefix to non static function names Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17drm/amdkfd: replace asic_family with asic_typeGraham Sider
asic_family was a duplicate of asic_type, both of type amd_asic_type. Replace all instances of device_info->asic_family with adev->asic_type and remove asic_family from device_info. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17drm/amdkfd: convert misc checks to IP version checkingGraham Sider
Switch to IP version checking instead of asic_type on various KFD version checks. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17drm/amdkfd: convert switches to IP version checkingGraham Sider
Converts KFD switch statements to use IP version checking instead of asic_type. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17drm/amdkfd: replace trivial funcs with direct accessGraham Sider
These get funcs simply return an adev field. Replace funcs/calls with direct field accesses instead. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17drm/amdkfd: replace/remove remaining kgd_dev referencesGraham Sider
Remove get_amdgpu_device and other remaining kgd_dev references aside from declaration/kfd struct entry and initialization. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcsGraham Sider
Modified definitions: - amdgpu_amdkfd_get_fw_version - amdgpu_amdkfd_get_local_mem_info - amdgpu_amdkfd_get_gpu_clock_counter - amdgpu_amdkfd_get_max_engine_clock_in_mhz - amdgpu_amdkfd_get_cu_info - amdgpu_amdkfd_get_dmabuf_info - amdgpu_amdkfd_get_vram_usage - amdgpu_amdkfd_get_hive_id - amdgpu_amdkfd_get_unique_id - amdgpu_amdkfd_get_mmio_remap_phys_addr - amdgpu_amdkfd_get_num_gws - amdgpu_amdkfd_get_asic_rev_id - amdgpu_amdkfd_get_noretry - amdgpu_amdkfd_get_xgmi_hops_count - amdgpu_amdkfd_get_xgmi_bandwidth_mbytes - amdgpu_amdkfd_get_pcie_bandwidth_mbytes Also replaces kfd_device_by_kgd with kfd_device_by_adev, now searching via adev rather than kgd. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-19drm/amdkfd: map gpu hive id to xgmi connected cpuJonathan Kim
ROCr needs to be able to identify all devices that have direct access to fine grain memory, which should include CPUs that are connected to GPUs over xGMI. The GPU hive ID can be mapped onto the CPU hive ID since the CPU is part of the hive. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-05drm/amdkfd: Expose GFXIP engine version to sysfsGraham Sider
Add u32 gfx_target_version field to kfd_node_properties and kfd_device_info. Populate <asic>_device_info structs accordingly and expose to sysfs. This allows eliminating device-ID-based lookup tables in user mode for future ASICs. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23drm/amdkfd: enable cyan_skillfish KFDTao Zhou
Add KFD support for cyan_skillfish. v2: whitespace fixes (Alex) Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23drm/amdkfd: Renaming dqm->packets to dqm->packet_mgrOak Zeng
Renaming packets to packet_mgr to reflect the real meaning of this variable. Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Acked-by: Christian Konig <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-18drm/amdkfd: Set iolink non-coherent in topologyEric Huang
Fix non-coherent bit of iolink properties flag which always is 0. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15drm/amdkfd: Disable SVM per GPU, not per processFelix Kuehling
When some GPUs don't support SVM, don't disabe it for the entire process. That would be inconsistent with the information the process got from the topology, which indicates SVM support per GPU. Instead disable SVM support only for the unsupported GPUs. This is done by checking any per-device attributes against the bitmap of supported GPUs. Also use the supported GPU bitmap to initialize access bitmaps for new SVM address ranges. Don't handle recoverable page faults from unsupported GPUs. (I don't think there will be unsupported GPUs that can generate recoverable page faults. But better safe than sorry.) Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <philip.yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-04drm/amdkfd: add yellow carp KFD supportAaron Liu
This patch is to add GFX10 based Yellow Carp KFD support. We will bypass IOMMU v2. Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19drm/amdkfd: support beige_goby KFDChengming Gui
Add KFD support for beige_goby v2: fix asic name typo v3: squash in updates (Alex) v4: squash in needs_atomics fix (Alex) Signed-off-by: Chengming Gui <Jack.Gui@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>