summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd/kfd_process.c
AgeCommit message (Collapse)Author
2020-10-15Merge tag 'drm-next-2020-10-15' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm updates from Dave Airlie: "Not a major amount of change, the i915 trees got split into display and gt trees to better facilitate higher level review, and there's a major refactoring of i915 GEM locking to use more core kernel concepts (like ww-mutexes). msm gets per-process pagetables, older AMD SI cards get DC support, nouveau got a bump in displayport support with common code extraction from i915. Outside of drm this contains a couple of patches for hexint moduleparams which you've acked, and a virtio common code tree that you should also get via it's regular path. New driver: - Cadence MHDP8546 DisplayPort bridge driver core: - cross-driver scatterlist cleanups - devm_drm conversions - remove drm_dev_init - devm_drm_dev_alloc conversion ttm: - lots of refactoring and cleanups bridges: - chained bridge support in more drivers panel: - misc new panels scheduler: - cleanup priority levels displayport: - refactor i915 code into helpers for nouveau i915: - split into display and GT trees - WW locking refactoring in GEM - execbuf2 extension mechanism - syncobj timeline support - GEN 12 HOBL display powersaving - Rocket Lake display additions - Disable FBC on Tigerlake - Tigerlake Type-C + DP improvements - Hotplug interrupt refactoring amdgpu: - Sienna Cichlid updates - Navy Flounder updates - DCE6 (SI) support for DC - Plane rotation enabled - TMZ state info ioctl - PCIe DPC recovery support - DC interrupt handling refactor - OLED panel fixes amdkfd: - add SMI events for thermal throttling - SMI interface events ioctl update - process eviction counters radeon: - move to dma_ for allocations - expose sclk via sysfs msm: - DSI support for sm8150/sm8250 - per-process GPU pagetable support - Displayport support mediatek: - move HDMI phy driver to PHY - convert mtk-dpi to bridge API - disable mt2701 tmds tegra: - bridge support exynos: - misc cleanups vc4: - dual display cleanups ast: - cleanups gma500: - conversion to GPIOd API hisilicon: - misc reworks ingenic: - clock handling and format improvements mcde: - DSI support mgag200: - desktop g200 support mxsfb: - i.MX7 + i.MX8M - alpha plane support panfrost: - devfreq support - amlogic SoC support ps8640: - EDID from eDP retrieval tidss: - AM65xx YUV workaround virtio: - virtio-gpu exported resources rcar-du: - R8A7742, R8A774E1 and R8A77961 support - YUV planar format fixes - non-visible plane handling - VSP device reference count fix - Kconfig fix to avoid displaying disabled options in .config" * tag 'drm-next-2020-10-15' of git://anongit.freedesktop.org/drm/drm: (1494 commits) drm/ingenic: Fix bad revert drm/amdgpu: Fix invalid number of character '{' in amdgpu_acpi_init drm/amdgpu: Remove warning for virtual_display drm/amdgpu: kfd_initialized can be static drm/amd/pm: setup APU dpm clock table in SMU HW initialization drm/amdgpu: prevent spurious warning drm/amdgpu/swsmu: fix ARC build errors drm/amd/display: Fix OPTC_DATA_FORMAT programming drm/amd/display: Don't allow pstate if no support in blank drm/panfrost: increase readl_relaxed_poll_timeout values MAINTAINERS: Update entry for st7703 driver after the rename Revert "gpu/drm: ingenic: Add option to mmap GEM buffers cached" drm/amd/display: HDMI remote sink need mode validation for Linux drm/amd/display: Change to correct unit on audio rate drm/amd/display: Avoid set zero in the requested clk drm/amdgpu: align frag_end to covered address space drm/amdgpu: fix NULL pointer dereference for Renoir drm/vmwgfx: fix regression in thp code due to ttm init refactor. drm/amdgpu/swsmu: add interrupt work handler for smu11 parts drm/amdgpu/swsmu: add interrupt work function ...
2020-09-30drm/amd/amdkfd: Surface files in Sysfs to allow users to get number ofRamesh Errabolu
compute units that are in use. [Why] Allow user to know how many compute units (CU) are in use at any given moment. [How] Surface files in Sysfs that allow user to determine the number of compute units that are in use for a given process. One Sysfs file is used per device. Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com> Reviewed-By: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-22drm/amdkfd: Fix kfd init stack dumpPhilip Cox
amdkfd is dumping a stack during initialization. kfd_procfs_add_sysfs_stats is being called twice. This removes one of them. Fixes: 4327bed2ff8e3d ("drm/amdkfd: Add process eviction counters to sysfs") Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Philip Cox <Philip.Cox@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-22drm/amdkfd: Move process doorbell allocation into kfd deviceMukul Joshi
Move doorbell allocation for a process into kfd device and allocate doorbell space in each PDD during process creation. Currently, KFD manages its own doorbell space but for some devices, amdgpu would allocate the complete doorbell space instead of leaving a chunk of doorbell space for KFD to manage. In a system with mix of such devices, KFD would need to request process doorbell space based on the type of device, either from amdgpu or from its own doorbell space. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-17drm/amdkfd: Add process eviction counters to sysfsPhilip Cox
Add per-process eviction counters to sysfs to keep track of how many eviction events have happened for each process. v2: rename the stats dir, and track all evictions per process, per device. v3: Simplify the stats kobject handling and cleanup. v4: more code cleanup Signed-off-by: Philip Cox <Philip.Cox@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-17drm/amdkfd: Add some eveiction debugging codePhilip Cox
Extending the module parameter debug_evictions to also print a stack trace when the eviction code path is called. Signed-off-by: Philip Cox <Philip.Cox@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-17drm, iommu: Change type of pasid to u32Fenghua Yu
PASID is defined as a few different types in iommu including "int", "u32", and "unsigned int". To be consistent and to match with uapi definitions, define PASID and its variations (e.g. max PASID) as "u32". "u32" is also shorter and a little more explicit than "unsigned int". No PASID type change in uapi although it defines PASID as __u64 in some places. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/1600187413-163670-2-git-send-email-fenghua.yu@intel.com
2020-08-24drm/amdkfd: sparse: Fix warning in reading SDMA countersMukul Joshi
Add __user annotation to fix related sparse warning while reading SDMA counters from userland. Also, rework the read SDMA counters function by removing redundant checks. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-18drm/amdkfd: Initialize SDMA activity counter to 0Mukul Joshi
To prevent reporting erroneous SDMA usage, initialize SDMA activity counter to 0 before using. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14drm/amdgpu: revert "fix system hang issue during GPU reset"Christian König
The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-27drm/amdgpu: fix system hang issue during GPU resetDennis Li
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the atomic adev->in_gpu_reset and hive->in_reset are used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore adev->reset_sem is introduced, which protect GPU from being accessed by external threads during recovery. v2: 1. add rwlock for some ioctls, debugfs and file-close function. 2. change to use dqm->is_resetting and dqm_lock for protection in kfd driver. 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. v3: 1. change back to use adev->reset_sem to protect kfd callback functions, because dqm_lock couldn't protect all codes, for example: free_mqd must be called outside of dqm_lock; [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 1230.177221] Call Trace: [ 1230.178249] dump_stack+0x98/0xd5 [ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu] [ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu] [ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu] [ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu] [ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm] [ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm] [ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm] [ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm] [ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm] [ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm] [ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu] [ 1230.193833] free_mqd+0x25/0x40 [amdgpu] [ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu] [ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu] [ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu] [ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu] [ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu] [ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20 [ 1230.202831] ksys_ioctl+0x98/0xb0 [ 1230.204004] __x64_sys_ioctl+0x1a/0x20 [ 1230.205174] do_syscall_64+0x5f/0x250 [ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe 2. remove try_lock and introduce atomic hive->in_reset, to avoid re-enter GPU recovery. v4: 1. remove an unnecessary whitespace change in kfd_chardev.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com> Suggested-by: Luben Tukov <luben.tuikov@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-02Merge tag 'amd-drm-next-5.9-2020-07-01' of ↵Dave Airlie
git://people.freedesktop.org/~agd5f/linux into drm-next amd-drm-next-5.9-2020-07-01: amdgpu: - DC DMUB updates - HDCP fixes - Thermal interrupt fixes - Add initial support for Sienna Cichlid GPU - Add support for unique id on Arcturus - Major swSMU code cleanup - Skip BAR resizing if the bios already did id - Fixes for DCN bandwidth calculations - Runtime PM reference count fixes - Add initial UVD support for SI - Add support for ASSR on eDP links - Lots of misc fixes and cleanups - Enable runtime PM on vega10 boards that support BACO - RAS fixes - SR-IOV fixes - Use IP discovery table on renoir - DC stream synchronization fixes amdkfd: - Track SDMA usage per process - Fix GCC10 compiler warnings - Locking fix radeon: - Default to on chip GART for AGP boards on all arches - Runtime PM reference count fixes UAPI: - Update comments to clarify MTYPE From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200701155041.1102829-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2020-07-01drm/amdkfd: Fix circular locking dependency warningMukul Joshi
[ 150.887733] ====================================================== [ 150.893903] WARNING: possible circular locking dependency detected [ 150.905917] ------------------------------------------------------ [ 150.912129] kfdtest/4081 is trying to acquire lock: [ 150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90 [ 150.924490] but task is already holding lock: [ 150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at: destroy_queue_cpsch+0x29/0x210 [amdgpu] [ 150.939432] which lock already depends on the new lock. [ 150.947603] the existing dependency chain (in reverse order) is: [ 150.955074] -> #3 (&dqm->lock_hidden){+.+.}: [ 150.960822] __mutex_lock+0xa1/0x9f0 [ 150.964996] evict_process_queues_cpsch+0x22/0x120 [amdgpu] [ 150.971155] kfd_process_evict_queues+0x3b/0xc0 [amdgpu] [ 150.977054] kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu] [ 150.982442] amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu] [ 150.988615] amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu] [ 150.994448] __mmu_notifier_invalidate_range_start+0xa4/0x240 [ 151.000714] copy_page_range+0xd70/0xd80 [ 151.005159] dup_mm+0x3ca/0x550 [ 151.008816] copy_process+0x1bdc/0x1c70 [ 151.013183] _do_fork+0x76/0x6c0 [ 151.016929] __x64_sys_clone+0x8c/0xb0 [ 151.021201] do_syscall_64+0x4a/0x1d0 [ 151.025404] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 151.030977] -> #2 (&adev->notifier_lock){+.+.}: [ 151.036993] __mutex_lock+0xa1/0x9f0 [ 151.041168] amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu] [ 151.047019] __mmu_notifier_invalidate_range_start+0xa4/0x240 [ 151.053277] copy_page_range+0xd70/0xd80 [ 151.057722] dup_mm+0x3ca/0x550 [ 151.061388] copy_process+0x1bdc/0x1c70 [ 151.065748] _do_fork+0x76/0x6c0 [ 151.069499] __x64_sys_clone+0x8c/0xb0 [ 151.073765] do_syscall_64+0x4a/0x1d0 [ 151.077952] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 151.083523] -> #1 (mmu_notifier_invalidate_range_start){+.+.}: [ 151.090833] change_protection+0x802/0xab0 [ 151.095448] mprotect_fixup+0x187/0x2d0 [ 151.099801] setup_arg_pages+0x124/0x250 [ 151.104251] load_elf_binary+0x3a4/0x1464 [ 151.108781] search_binary_handler+0x6c/0x210 [ 151.113656] __do_execve_file.isra.40+0x7f7/0xa50 [ 151.118875] do_execve+0x21/0x30 [ 151.122632] call_usermodehelper_exec_async+0x17e/0x190 [ 151.128393] ret_from_fork+0x24/0x30 [ 151.132489] -> #0 (&mm->mmap_sem#2){++++}: [ 151.138064] __lock_acquire+0x11a1/0x1490 [ 151.142597] lock_acquire+0x90/0x180 [ 151.146694] __might_fault+0x68/0x90 [ 151.150879] read_sdma_queue_counter+0x5f/0xb0 [amdgpu] [ 151.156693] update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu] [ 151.163725] destroy_queue_cpsch+0x1ae/0x210 [amdgpu] [ 151.169373] pqm_destroy_queue+0xf0/0x250 [amdgpu] [ 151.174762] kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu] [ 151.180577] kfd_ioctl+0x223/0x400 [amdgpu] [ 151.185284] ksys_ioctl+0x8f/0xb0 [ 151.189118] __x64_sys_ioctl+0x16/0x20 [ 151.193389] do_syscall_64+0x4a/0x1d0 [ 151.197569] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 151.203141] other info that might help us debug this: [ 151.211140] Chain exists of: &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden [ 151.222535] Possible unsafe locking scenario: [ 151.228447] CPU0 CPU1 [ 151.232971] ---- ---- [ 151.237502] lock(&dqm->lock_hidden); [ 151.241254] lock(&adev->notifier_lock); [ 151.247774] lock(&dqm->lock_hidden); [ 151.254038] lock(&mm->mmap_sem#2); This commit fixes the warning by ensuring get_user() is not called while reading SDMA stats with dqm_lock held as get_user() could cause a page fault which leads to the circular locking scenario. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01drm/amd: fix potential memleak in err branchBernard Zhao
The function kobject_init_and_add alloc memory like: kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs ->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller ->kmalloc_slab, in err branch this memory not free. If use kmemleak, this path maybe catched. These changes are to add kobject_put in kobject_init_and_add failed branch, fix potential memleak. Signed-off-by: Bernard Zhao <bernard@vivo.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01drm/amdkfd: label internally used symbols as staticNirmoy Das
Used sparse(make C=1) to find these loose ends. v2: removed unwanted extra line Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01drm/amdkfd: fix ref count leak when pm_runtime_get_sync failsAlex Deucher
The call to pm_runtime_get_sync increments the counter even in case of failure, leading to incorrect ref count. In case of failure, decrement the ref count before returning. Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-29drm/amdkfd: fix a dereference of pdd before it is null checkedColin Ian King
Currently pointer pdd is being dereferenced when assigning pointer dpm and then pdd is being null checked. Fix this by checking if pdd is null before the dereference of pdd occurs. Addresses-Coverity: ("Dereference before null check") Fixes: 32cb59f31362 ("drm/amdkfd: Track SDMA utilization per process") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-28drm/amdkfd: Track SDMA utilization per processMukul Joshi
Track SDMA usage on a per process basis and report it through sysfs. The value in the sysfs file indicates the amount of time SDMA has been in-use by this process since the creation of the process. This value is in microsecond granularity. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-01drm/amdkfd: Fix comment formattingFelix Kuehling
Corrected two function names. Added a missing space. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-30drm/amdkfd: Track GPU memory utilization per processMukul Joshi
Track GPU VRAM usage on a per process basis and report it through sysfs. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28drm/amdkfd: Enable over-subscription with >1 GWS queueJoseph Greathouse
The current GWS usage model will only allows a single GWS-enabled process to be active on the GPU at once. This ensures that a barrier-using kernel gets a known amount of GPU hardware, to prevent deadlock due to inability to go beyond the GWS barrier. The HWS watches how many GWS entries are assigned to each process, and goes into over-subscription mode when two processes need more than the 64 that are available. The current KFD method for working with this is to allocate all 64 GWS entries to each GWS-capable process. When more than one GWS-enabled process is in the runlist, we must make sure the runlist is in over-subscription mode, so that the HWS gets a chained RUN_LIST packet and continues scheduling kernels. Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10drm/amdkfd: Consolidate duplicated bo alloc flagsYong Zhao
ALLOC_MEM_FLAGS_* used are the same as the KFD_IOC_ALLOC_MEM_FLAGS_*, but they are interweavedly used in kernel driver, resulting in bad readability. For example, KFD_IOC_ALLOC_MEM_FLAGS_COHERENT is not referenced in kernel, and it functions implicitly in kernel through ALLOC_MEM_FLAGS_COHERENT, causing unnecessary confusion. Replace all occurrences of ALLOC_MEM_FLAGS_* with KFD_IOC_ALLOC_MEM_FLAGS_* to solve the problem. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06drm/amdkfd: Signal eviction fence on process destruction (v2)Felix Kuehling
Otherwise BOs may wait for the fence indefinitely and never be destroyed. v2: Signal the fence right after destroying queues to avoid unnecessary delaye-delete in kfd_process_wq_release Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: xinhui pan <xinhui.pan@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-12drm/amdkfd: refactor runtime pm for bacoRajneesh Bhardwaj
So far the kfd driver implemented same routines for runtime and system wide suspend and resume (s2idle or mem). During system wide suspend the kfd aquires an atomic lock that prevents any more user processes to create queues and interact with kfd driver and amd gpu. This mechanism created problem when amdgpu device is runtime suspended with BACO enabled. Any application that relies on kfd driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended. However, in an ideal case, when gpu is runtime suspended the kfd driver should be able to: - auto resume amdgpu driver whenever a client requests compute service - prevent runtime suspend for amdgpu while kfd is in use This change refactors the amdgpu and amdkfd drivers to support BACO and runtime power management. Reviewed-by: Oak Zeng <oak.zeng@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-06drm/amdkfd: Add queue information to sysfsAmber Lin
Provide compute queues information in sysfs under /sys/class/kfd/kfd/proc. The format is /sys/class/kfd/kfd/proc/<pid>/queues/<queue id>/XX where XX are size, type, and gpuid three files to represent queue size, queue type, and the GPU this queue uses. <queue id> folder and files underneath are generated when a queue is created. They are removed when the queue is destroyed. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/amdgpu: GPU TLB flush API moved to amdgpu_amdkfdAlex Sierra
[Why] TLB flush method has been deprecated using kfd2kgd interface. This implementation is now on the amdgpu_amdkfd API. [How] TLB flush functions now implemented in amdgpu_amdkfd. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-09drm/amdkfd: Improve kfd_process lookup in kfd_ioctlFelix Kuehling
Use filep->private_data to store a pointer to the kfd_process data structure. Take an extra reference for that, which gets released in the kfd_release callback. Check that the process calling kfd_ioctl is the same that opened the file descriptor. Return -EBADF if it's not, so that this error can be distinguished in user mode. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-13drm/amdkfd: Simplify the mmap offset related bit operationsYong Zhao
The new code uses straightforward bit shifts and thus has better readability. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-26Merge tag 'drm-next-5.5-2019-10-09' of ↵Dave Airlie
git://people.freedesktop.org/~agd5f/linux into drm-next drm-next-5.5-2019-10-09: amdgpu: - Additional RAS enablement for vega20 - RAS page retirement and bad page storage in EEPROM - No GPU reset with unrecoverable RAS errors - Reserve vram for page tables rather than trying to evict - Fix issues with GPU reset and xgmi hives - DC i2c over aux fixes - Direct submission for clears, PTE/PDE updates - Improvements to help support recoverable GPU page faults - Silence harmless SAD block messages - Clean up code for creating a bo at a fixed location - Initial DC HDCP support - Lots of documentation fixes - GPU reset for renoir - Add IH clockgating support for soc15 asics - Powerplay improvements - DC MST cleanups - Add support for MSI-X - Misc cleanups and bug fixes amdkfd: - Query KFD device info by asic type rather than pci ids - Add navi14 support - Add renoir support - Add navi12 support - gfx10 trap handler improvements - pasid cleanups - Check against device cgroup ttm: - Return -EBUSY with pipelining with no_gpu_wait radeon: - Silence harmless SAD block messages device_cgroup: - Export devcgroup_check_permission Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191010041713.3412-1-alexander.deucher@amd.com
2019-10-03drm/amdkfd: Use hex print format for pasidYong Zhao
Since KFD pasid starts from 0x8000 (32768 in decimal), it is better perceived as a hex number. Meanwhile, change the pasid type from unsigned int to uint16_t to be consistent throughout the code. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03drm/amdkfd: Remove excessive print when reserving doorbellsYong Zhao
The dozens of printing messages are compressed into 2 lines. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-21Merge tag 'for-linus-hmm' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...
2019-08-21drm/amdkfd: remove set but not used variable 'pdd'YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c: In function restore_process_worker: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:949:29: warning: variable pdd set but not used [-Wunused-but-set-variable] It is not used since commit 5b87245faf57 ("drm/amdkfd: Simplify kfd2kgd interface") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-20drm/amdkfd: use mmu_notifier_putJason Gunthorpe
The sequence of mmu_notifier_unregister_no_release(), mmu_notifier_call_srcu() is identical to mmu_notifier_put() with the free_notifier callback. As this is the last user of those APIs, converting it means we can drop them. Link: https://lore.kernel.org/r/20190806231548.25242-11-jgg@ziepe.ca Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20drm/amdkfd: fix a use after free race with mmu_notifer unregisterJason Gunthorpe
When using mmu_notifer_unregister_no_release() the caller must ensure there is a SRCU synchronize before the mn memory is freed, otherwise use after free races are possible, for instance: CPU0 CPU1 invalidate_range_start hlist_for_each_entry_rcu(..) mmu_notifier_unregister_no_release(&p->mn) kfree(mn) if (mn->ops->invalidate_range_end) The error unwind in amdkfd misses the SRCU synchronization. amdkfd keeps the kfd_process around until the mm is released, so split the flow to fully initialize the kfd_process and register it for find_process, and with the notifier. Past this point the kfd_process does not need to be cleaned up as it is fully ready. The final failable step does a vm_mmap() and does not seem to impact the kfd_process global state. Since it also cannot be undone (and already has problems with undo if it internally fails), it has to be last. This way we don't have to try to unwind the mmu_notifier_register() and avoid the problem with the SRCU. Along the way this also fixes various other error unwind bugs in the flow. Fixes: 45102048f77e ("amdkfd: Add process queue manager module") Link: https://lore.kernel.org/r/20190806231548.25242-10-jgg@ziepe.ca Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-15drm/amdkfd: Fill amdgpu_task_info for KFD VMsYong Zhao
The amdgpu_task_info will be used when printing VM page fault for KFD processes. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanatha@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-21drm/amdkfd: Add navi10 support to amdkfd. (v3)Philip Cox
KFD (kernel fusion driver) is the kernel driver for the compute backend for usermode compute stack. v2: squash in updates (Alex) v3: squash in rebase fixes (Alex) Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Philip Cox <Philip.Cox@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-20drm/amdkfd: Add procfs-style information for KFD processesKent Russell
Add a folder structure to /sys/class/kfd/kfd/ called proc which contains subfolders, each representing an active KFD process' PID, containing 1 file: pasid. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-02-18drm/amdkfd: Fix bugs regarding CP queue doorbell mask on SOC15Yong Zhao
Reserved doorbells for SDMA IH and VCN were not properly masked out when allocating doorbells for CP user queues. This patch fixed that. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-05drm/amdkfd: Simplify kfd2kgd interfaceAmber Lin
After amdkfd module is merged into amdgpu, KFD can call amdgpu directly and no longer needs to use the function pointer. Replace those function pointers with functions if they are not ASIC dependent. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-29drm/amdkfd: Release an acquired process vmOak Zeng
For compute vm acquired from amdgpu, vm.pasid is managed by kfd. Decouple pasid from such vm on process destroy to avoid duplicate pasid release. Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-29drm/amdgpu: Set pasid for compute vm (v2)Oak Zeng
To make a amdgpu vm to a compute vm, the old pasid will be freed and replaced with a pasid managed by kfd. Kfd can't reuse original pasid allocated by amdgpu because kfd uses different pasid policy with amdgpu. For example, all graphic devices share one same pasid in a process. v2: rebase (Alex) Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-07-11drm/amdkfd: Fix error codes in kfd_get_processWei Lu
Return ERR_PTR(-EINVAL) if kfd_get_process fails to find the process. This fixes kernel oopses when a child process calls KFD ioctls with a file descriptor inherited from the parent process. Signed-off-by: Wei Lu <wei.lu2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-04-10drm/amdkfd: Implement doorbell allocation for SOC15Felix Kuehling
Allocate doorbells according to the doorbell routing information on SOC15 ASICs (Vega10 and later). On older ASICs we continue to use the queue_id as the doorbell ID to maintain compatibility with the Thunk. Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-04-10drm/amdkfd: Clean up KFD_MMAP_ offset handlingHarish Kasiviswanathan
Use bit-rotate for better clarity and remove _MASK from the #defines as these represent mmap types. Centralize all the parsing of the mmap offset in kfd_mmap and add device parameter to doorbell and reserved_mem map functions. Encode gpu_id into upper bits of vm_pgoff. This frees up the lower bits for encoding the the doorbell ID on Vega10. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-03-23drm/amdkfd: Add quiesce_mm and resume_mm to kgd2kfd_callsFelix Kuehling
These interfaces allow KGD to stop and resume all GPU user mode queue access to a process address space. This is needed for handling MMU notifiers of userptrs mapped for GPU access in KFD VMs. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-03-23drm/amdkfd: Use ordered workqueue to restore processesFelix Kuehling
Restoring multiple processes concurrently can lead to live-locks where each process prevents the other from validating all its BOs. v2: fix duplicate check of same variable Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-03-15drm/amdkfd: Add TC flush on VMID deallocation for HawaiiFelix Kuehling
On GFX7 the CP does not perform a TC flush when queues are unmapped. To avoid TC eviction from accessing an invalid VMID, flush it explicitly before releasing a VMID. v2: Fix unnecessary list_for_each_entry_safe v3: Moved allocation to kfd_process_device_init_vm Signed-off-by: Amber Lin <Amber.Lin@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-03-15drm/amdkfd: Allocate CWSR trap handler memory for dGPUsFelix Kuehling
Add helpers for allocating GPUVM memory in kernel mode and use them to allocate memory for the CWSR trap handler. v2: Use dev instead of pdd->dev in kfd_process_free_gpuvm v3: * Cleaned up and simplified kfd_process_alloc_gpuvm * Moved allocation for dGPU to kfd_process_device_init_vm Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-03-15drm/amdkfd: Add per-process IDR for buffer handlesFelix Kuehling
Also used for cleaning up on process termination. v2: Refactored cleanup on process termination Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>