summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
AgeCommit message (Collapse)Author
2023-06-09amd/amdkfd: drop unused KFD_IOCTL_SVM_FLAG_UNCACHED flagAlex Deucher
Was leftover from GC 9.4.3 bring up and is currently unused. Drop it for now. Cc: Philip.Yang@amd.com Cc: rajneesh.bhardwaj@amd.com Cc: Felix.Kuehling@amd.com Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Add function parameter 'event' to kdoc in svm_range_evict()Srinivasan Shanmugam
Fixes the following gcc with W=1: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:1841: warning: Function parameter or member 'event' not described in 'svm_range_evict' Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: flag added to handle errors from svm validate and mapAlex Sierra
If a return error is raised during validation and mapping of a prange, this flag is set. It is a rare occurrence, but it could happen when `amdgpu_hmm_range_get_pages_done` returns true. In such cases, the caller should retry. However, it is important to ensure that the prange is updated correctly during the retry. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: fix stack size in svm_range_validate_and_mapAlex Deucher
Allocate large local variable on heap to avoid exceeding the stack size: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c: In function ‘svm_range_validate_and_map’: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:1690:1: warning: the frame size of 2360 bytes is larger than 2048 bytes [-Wframe-larger-than=] Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu/gmc9: fix 64 bit division in partition codeAlex Deucher
Rework logic or use do_div() to avoid problems on 32 bit. v2: add a missing case for XCP macro v3: fix out of bounds array access v4: fix xcp handling harder Acked-by: Guchun Chen <guchun.chen@amd.com> (v1) Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> (v3) Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: APU mode set max svm range pagesPhilip Yang
svm_migrate_init set the max svm range pages based on the KFD nodes partition size. APU mode don't init pgmap because there is no migration. kgd2kfd_device_init calls svm_migrate_init after KFD nodes allocation and initialization. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Fix memory reporting on GFX 9.4.3Mukul Joshi
This patch fixes memory reporting on the GFX 9.4.3 APU and dGPU by reporting available memory on a per partition basis. If its an APU, available and used memory calculations take into account system and TTM memory. v2: squash in fix ("drm/amdkfd: Fix array out of bound warning") squash in fix ("drm/amdgpu: Update memory reporting for GFX9.4.3") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Store xcp partition id to amdgpu boPhilip Yang
For memory accounting per compute partition and export drm amdgpu bo and then import to KFD, we need the xcp id to account the memory usage or find the KFD node of the original amdgpu bo to create the KFD bo on the correct adev KFD node. Set xcp_id_plus1 of amdgpu_bo_param to create bo and store xcp_id to amddgpu bo. Add helper macro to get the mem_id from adev and xcp_id. v2: squash in fix ("drm/amdgpu: Fix BO creation failure on GFX 9.4.3 dGPU") Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Update MTYPE for far memory partitionPhilip Yang
Use MTYPE RW/MTYPE_CC for mapping system memory or VRAM to KFD node within the same memory partition, use MTYPE_NC for mapping on KFD node from the far memory partition of the same socket or from another socket on same XGMI hive. On NPS4 or 4P system, MTYPE will be overridden per page depending on the memory NUMA node id and vm->mem_id. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: SVM range allocation support memory partitionPhilip Yang
Pass kfd node->xcp->mem_id to amdgpu bo create parameter mem_id_plus1 to allocate new svm_bo on the specified memory partition. This is only for dGPU mode as we don't migrate with APU mode. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu/bu: update mtype_local parameter settingsGraham Sider
Update mtype_local module parameter to use MTYPE_RW by default. 0: MTYPE_RW (default) 1: MTYPE_NC 2: MTYPE_CC Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu/bu: add mtype_local as a module parameterDavid Francis
Selects the MTYPE to be used for local memory, (0 = MTYPE_CC (default), 1 = MTYPE_NC, 2 = MTYPE_RW) v2: squash in build fix (Alex) Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Fix per-BO MTYPE selection for GFXv9.4.3Felix Kuehling
Treat system memory on NUMA systems as remote by default. Overriding with a more efficient MTYPE per page will be implemented in the next patch. No need for a special case for APP APUs. System memory is handled the same for carve-out and native mode. And VRAM doesn't exist in native mode. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Reviewed-and-tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu/bu: Add use_mtype_cc_wa module paramGraham Sider
By default, set use_mtype_cc_wa to 1 to set PTE coherence flag MTYPE_CC instead of MTYPE_RW by default. This is required for the time being to mitigate a bug causing XCCs to hit stale data due to TCC marking fully dirty lines as exclusive. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Move pgmap to amdgpu_kfd_dev structurePhilip Yang
VRAM pgmap resource is allocated every time when switching compute partitions because kfd_dev is re-initialized by post_partition_switch, As a result, it causes memory region resource leaking and system memory usage accounting unbalanced. pgmap resource should be allocated and registered only once when loading driver and freed when unloading driver, move it from kfd_dev to amdgpu_kfd_dev. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Enable SVM on Native modeMukul Joshi
This patch enables SVM capability on GFX9.4.3 when run in Native mode. It also sets best_prefetch and best_restore locations to CPU as there is no VRAM. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Correct dGPU MTYPE settings for gfx943Graham Sider
Revert temporary dGPU VRAM MTYPE setting and align with expected coherency protocol. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09amd/amdgpu: Set MTYPE_UC for access over PCIeAmber Lin
For GFX v9_4_3, set MTYPE_UC for memory access over PCIe. v4 - add missing indentation pointed out by Felix and add his reviewed-by tag. v3 - add missing logic for the svm path. v2 - add amdgpu_xgmi_same_hive to separate access over xgmi from pcie Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Amber Lin <Amber.Lin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Update interrupt handling for GFX9.4.3Mukul Joshi
Update interrupt handling in CPX mode for GFX9.4.3 by using the VMID space instead of SDMA client id to determine if an interrupt should be processed by a KFD node. This is especially needed for handling retry faults from MMHUB. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Update SMI events for GFX9.4.3Mukul Joshi
On GFX 9.4.3, there can be multiple KFD nodes. As a result, SMI events for SVM, queue evict/restore should be raised for each node independently. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: pass kfd_node ref to svm migration apiAlex Sierra
This work is required for GC 9.4.3, previous to support memory partitions per node at SVM. When multiple partition is configured, every BO should be allocated inside one specific partition which corresponds to the current amdgpu_device and kfd_node. v2: squash in compilation fix (Alex) v3: squash in fix for pre-gfx 9.4.3 (Alex) v4: squash in best_loc fix (Alex) Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Update coherence settings for svm rangesRajneesh Bhardwaj
Recently introduced commit "drm/amdgpu: Set cache coherency for GC 9.4.3" did not update the settings applicable for svm ranges. Add the coherence settings for svm ranges for GFX IP 9.4.3. Reviewed-by: Amber Lin <amber.lin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Introduce kfd_node struct (v5)Mukul Joshi
Introduce a new structure, kfd_node, which will now represent a compute node. kfd_node is carved out of kfd_dev structure. kfd_dev struct now will become the parent of kfd_node, and will store common resources such as doorbells, GTT sub-alloctor etc. kfd_node struct will store all resources specific to a compute node, such as device queue manager, interrupt handling etc. This is the first step in adding compute partition support in KFD. v2: introduce kfd_node struct to gc v11 (Hawking) v3: make reference to kfd_dev struct through kfd_node (Morris) v4: use kfd_node instead for kfd isr/mqd functions (Morris) v5: rebase (Alex) Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Optimize svm range map to GPU with XNACK onPhilip Yang
With XNACK on if svm_range_set_attr set the range access or access_in_place attribute, we don't call svm_range_validate_and_map to update GPU mapping. This avoids prefaulting the range pages on system memory if the range is not prefetch to VRAM and not mapped to GPUs. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-13drm/amdgpu: Enable IH retry CAM on GFX9Mukul Joshi
This patch enables the IH retry CAM on GFX9 series cards. This retry filter is used to prevent sending lots of retry interrupts in a short span of time and overflowing the IH ring buffer. This will also help reduce CPU interrupt workload. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-16Merge tag 'amd-drm-next-6.3-2023-01-13' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.3-2023-01-13: amdgpu: - Fix possible segfault in failure case - Rework FW requests to happen in early_init for all IPs so that we don't lose the sbios console if FW is missing - PSR fixes - Misc cleanups - Unload fix - SMU13 fixes amdkfd: - Fix for cleared VRAM BOs - Fix cleanup if GPUVM creation fails - Memory accounting fix - Use resource_size rather than open codeing it - GC11 mGPU fix radeon: - Fix memory leak on shutdown Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230113225911.7776-1-alexander.deucher@amd.com
2023-01-10drm/amdkfd: Add sync after creating vram boEric Huang
There will be data corruption on vram allocated by svm if the initialization is not complete and application is writting on the memory. Adding sync to wait for the initialization completion is to resolve this issue. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-06drm/ttm: merge ttm_bo_api.h and ttm_bo_driver.h v2Christian König
Merge and cleanup the two headers into a single description of the object API. Also move all the documentation to the implementation and drop unnecessary includes from the header. No functional change. v2: minimal checkpatch.pl cleanup Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221125102137.1801-4-christian.koenig@amd.com
2022-11-17drm/amdgpu: cleanup amdgpu_hmm_range_get_pagesChristian König
Remove unused parameters and cleanup dead code. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17drm/amdgpu: rename the files for HMM handlingChristian König
Clean that up a bit, no functional change. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: Cleanup kfd_dev structMukul Joshi
Cleanup kfd_dev struct by removing ddev and pdev as both drm_device and pci_dev can be fetched from amdgpu_device. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdkfd: use vma_lookup() instead of find_vma()Deming Wang
Using vma_lookup() verifies the start address is contained in the found vma. This results in easier to read the code. Signed-off-by: Deming Wang <wangdeming@inspur.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-14Merge tag 'mm-stable-2022-10-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - fix a race which causes page refcounting errors in ZONE_DEVICE pages (Alistair Popple) - fix userfaultfd test harness instability (Peter Xu) - various other patches in MM, mainly fixes * tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits) highmem: fix kmap_to_page() for kmap_local_page() addresses mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page mm/selftest: uffd: explain the write missing fault check mm/hugetlb: use hugetlb_pte_stable in migration race check mm/hugetlb: fix race condition of uffd missing/minor handling zram: always expose rw_page LoongArch: update local TLB if PTE entry exists mm: use update_mmu_tlb() on the second thread kasan: fix array-bounds warnings in tests hmm-tests: add test for migrate_device_range() nouveau/dmem: evict device private memory during release nouveau/dmem: refactor nouveau_dmem_fault_copy_one() mm/migrate_device.c: add migrate_device_range() mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() mm/memremap.c: take a pgmap reference on page allocation mm: free device private pages have zero refcount mm/memory.c: fix race when faulting a device private page mm/damon: use damon_sz_region() in appropriate place mm/damon: move sz_damon_region to damon_sz_region lib/test_meminit: add checks for the allocation functions ...
2022-10-12mm/memory.c: fix race when faulting a device private pageAlistair Popple
Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-30drm/amdkfd: Track unified memory when switching xnack modePhilip Yang
Unified memory usage with xnack off is tracked to avoid oversubscribe system memory, with xnack on, we don't track unified memory usage to allow memory oversubscribe. When switching xnack mode from off to on, subsequent free ranges allocated with xnack off will not unreserve memory. When switching xnack mode from on to off, subsequent free ranges allocated with xnack on will unreserve memory. Both cases cause memory accounting unbalanced. When switching xnack mode from on to off, need reserve already allocated svm range memory. When switching xnack mode from off to on, need unreserve already allocated svm range memory. v6: Take prange lock to access range child list v5: Handle prange child ranges v4: Handle reservation memory failure v3: Handle switching xnack mode race with svm_range_deferred_list_work v2: Handle both switching xnack from on to off and from off to on cases Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdkfd: Remove prefault before migrating to VRAMPhilip Yang
Prefaulting potentially allocates system memory pages before a migration. This adds unnecessary overhead. Instead we can skip unallocated pages in the migration and just point migrate->dst to a 0-initialized VRAM page directly. Then the VRAM page will be inserted to the PTE. A subsequent CPU page fault will migrate the page back to system memory. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16drm/amdkfd: Fix mm reference in SVM eviction workerFelix Kuehling
Use the mm reference from the fence. This allows removing the svm_bo->svms pointer, which was problematic because we cannot assume that the struct kfd_process containing the svms is still allocated without holding a refcount on the process. Use mmget_not_zero to ensure the mm is still valid, and drop the svm_bo reference if it isn't. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28drm/amdkfd: track unified memory reservation with xnack offAlex Sierra
[WHY] Unified memory with xnack off should be tracked, as userptr mappings and legacy allocations do. To avoid oversuscribe system memory when xnack off. [How] Exposing functions reserve_mem_limit and unreserve_mem_limit to SVM API and call them on every prange creation and free. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28drm/amdkfd: Split giant svm rangePhilip Yang
Giant svm range split to smaller ranges, align the range start address to max svm range pages to improve MMU TLB usage. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28drm/amdkfd: Set svm range max pagesPhilip Yang
This will be used to split giant svm range into smaller ranges, to support VRAM overcommitment by giant range and improve GPU retry fault recover on giant range. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-07drm/amdkfd: optimize svm range evictEric Huang
It is to avoid unnecessary queue eviction when range is not mapped to gpu. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-07drm/amdkfd: change svm range evictEric Huang
Adding always evict queues when flag is set to KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED as if XNACK off. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-30drm/amdkfd: Add unmap from GPU SMI eventPhilip Yang
SVM range unmapped from GPUs when range is unmapped from CPU, or with xnack on from MMU notifier when range is evicted or migrated. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-30drm/amdkfd: Add user queue eviction restore SMI eventPhilip Yang
Output user queue eviction and restore event. User queue eviction may be triggered by svm or userptr MMU notifier, TTM eviction, device suspend and CRIU checkpoint and restore. User queue restore may be rescheduled if eviction happens again while restore. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-30drm/amdkfd: Add migration SMI eventPhilip Yang
For migration start and end event, output timestamp when migration starts, ends, svm range address and size, GPU id of migration source and destination and svm range attributes, Migration trigger could be prefetch, CPU or GPU page fault and TTM eviction. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-30drm/amdkfd: Add GPU recoverable fault SMI eventPhilip Yang
Use ktime_get_boottime_ns() as timestamp to correlate with other APIs. Output timestamp when GPU recoverable fault starts and ends to recover the fault, if migration happened or only GPU page table is updated to recover, fault address, if read or write fault. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03drm/amdkfd: Fix partial migration bugsPhilip Yang
Migration range from system memory to VRAM, if system page can not be locked or unmapped, we do partial migration and leave some pages in system memory. Several bugs found to copy pages and update GPU mapping for this situation: 1. copy to vram should use migrate->npage which is total pages of range as npages, not migrate->cpages which is number of pages can be migrated. 2. After partial copy, set VRAM res cursor as j + 1, j is number of system pages copied plus 1 page to skip copy. 3. copy to ram, should collect all continuous VRAM pages and copy together. 4. Call amdgpu_vm_update_range, should pass in offset as bytes, not as number of pages. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2022-06-01drm/amdkfd: Use mmget_not_zero in MMU notifierPhilip Yang
MMU notifier callback may pass in mm with mm->mm_users==0 when process is exiting, use mmget_no_zero to avoid accessing invalid mm in deferred list work after mm is gone. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26drm/amdgpu: add AMDGPU_GEM_CREATE_DISCARDABLEChristian König
Add a AMDGPU_GEM_CREATE_DISCARDABLE flag to note that the content of a BO doesn't needs to be preserved during eviction. KFD was already using a similar functionality for SVM BOs so replace the internal flag with the new UAPI. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-06Merge tag 'amd-drm-next-5.19-2022-04-29' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.19-2022-04-29: amdgpu - RAS updates - SI dpm deadlock fix - Misc code cleanups - HDCP fixes - PSR fixes - DSC fixes - SDMA doorbell cleanups - S0ix fix - DC FP fix - Zen dom0 regression fix for APUs - IP discovery updates - Initial SoC21 support - Support for new vbios tables - Runtime PM fixes - Add PSP TA debugfs interface amdkfd: - Misc code cleanups - Ignore bogus MEC signals more efficiently - SVM fixes - Use bitmap helpers radeon: - Misc code cleanups - Spelling/grammer fixes From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220429144853.5742-1-alexander.deucher@amd.com