linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2021-05-19	drm/amd/amdgpu: fix a potential deadlock in gpu reset	Lang Yu
	When amdgpu_ib_ring_tests failed, the reset logic called amdgpu_device_ip_suspend twice, then deadlock occurred. Deadlock log: [ 805.655192] amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). [ 806.290952] [drm] free PSP TMR buffer [ 806.319406] ============================================ [ 806.320315] WARNING: possible recursive locking detected [ 806.321225] 5.11.0-custom #1 Tainted: G W OEL [ 806.322135] -------------------------------------------- [ 806.323043] cat/2593 is trying to acquire lock: [ 806.323825] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.325668] but task is already holding lock: [ 806.326664] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.328430] other info that might help us debug this: [ 806.329539] Possible unsafe locking scenario: [ 806.330549] CPU0 [ 806.330983] ---- [ 806.331416] lock(&adev->dm.dc_lock); [ 806.332086] lock(&adev->dm.dc_lock); [ 806.332738] * DEADLOCK * [ 806.333747] May be due to missing lock nesting notation [ 806.334899] 3 locks held by cat/2593: [ 806.335537] #0: ffff888100d3f1b8 (&attr->mutex){+.+.}-{3:3}, at: simple_attr_read+0x4e/0x110 [ 806.337009] #1: ffff888136b1fd78 (&adev->reset_sem){++++}-{3:3}, at: amdgpu_device_lock_adev+0x42/0x94 [amdgpu] [ 806.339018] #2: ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.340869] stack backtrace: [ 806.341621] CPU: 6 PID: 2593 Comm: cat Tainted: G W OEL 5.11.0-custom #1 [ 806.342921] Hardware name: AMD Celadon-CZN/Celadon-CZN, BIOS WLD0C23N_Weekly_20_12_2 12/23/2020 [ 806.344413] Call Trace: [ 806.344849] dump_stack+0x93/0xbd [ 806.345435] __lock_acquire.cold+0x18a/0x2cf [ 806.346179] lock_acquire+0xca/0x390 [ 806.346807] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.347813] __mutex_lock+0x9b/0x930 [ 806.348454] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.349434] ? amdgpu_device_indirect_rreg+0x58/0x70 [amdgpu] [ 806.350581] ? _raw_spin_unlock_irqrestore+0x47/0x50 [ 806.351437] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.352437] ? rcu_read_lock_sched_held+0x4f/0x80 [ 806.353252] ? rcu_read_lock_sched_held+0x4f/0x80 [ 806.354064] mutex_lock_nested+0x1b/0x20 [ 806.354747] ? mutex_lock_nested+0x1b/0x20 [ 806.355457] dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.356427] ? soc15_common_set_clockgating_state+0x17d/0x19 [amdgpu] [ 806.357736] amdgpu_device_ip_suspend_phase1+0x78/0xd0 [amdgpu] [ 806.360394] amdgpu_device_ip_suspend+0x21/0x70 [amdgpu] [ 806.362926] amdgpu_device_pre_asic_reset+0xb3/0x270 [amdgpu] [ 806.365560] amdgpu_device_gpu_recover.cold+0x679/0x8eb [amdgpu] Signed-off-by: Lang Yu <Lang.Yu@amd.com> Acked-by: Christian KÃnig <christian.koenig@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-13	drm/amdgpu: add judgement when add ip blocks (v2)	Likun GAO
	Judgement whether to add an sw ip according to the harvest info. v2: fix indentation (Alex) Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-29	drm/amdgpu: Register VGA clients after init can no longer fail	Kai-Heng Feng
	When an amdgpu device fails to init, it makes another VGA device cause kernel splat: kernel: amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_init failed kernel: amdgpu 0000:08:00.0: amdgpu: Fatal error during GPU init kernel: amdgpu: probe of 0000:08:00.0 failed with error -110 ... kernel: amdgpu 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none kernel: BUG: kernel NULL pointer dereference, address: 0000000000000018 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: 0000 [#1] SMP NOPTI kernel: CPU: 6 PID: 1080 Comm: Xorg Tainted: G W 5.12.0-rc8+ #12 kernel: Hardware name: HP HP EliteDesk 805 G6/872B, BIOS S09 Ver. 02.02.00 12/30/2020 kernel: RIP: 0010:amdgpu_device_vga_set_decode+0x13/0x30 [amdgpu] kernel: Code: 06 31 c0 c3 b8 ea ff ff ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 8b 87 90 06 00 00 48 89 e5 53 89 f3 <48> 8b 40 18 40 0f b6 f6 e8 40 58 39 fd 80 fb 01 5b 5d 19 c0 83 e0 kernel: RSP: 0018:ffffae3c0246bd68 EFLAGS: 00010002 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 kernel: RDX: ffff8dd1af5a8560 RSI: 0000000000000000 RDI: ffff8dce8c160000 kernel: RBP: ffffae3c0246bd70 R08: ffff8dd1af5985c0 R09: ffffae3c0246ba38 kernel: R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000246 kernel: R13: 0000000000000000 R14: 0000000000000003 R15: ffff8dce81490000 kernel: FS: 00007f9303d8fa40(0000) GS:ffff8dd1af580000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000018 CR3: 0000000103cfa000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: vga_arbiter_notify_clients.part.0+0x4a/0x80 kernel: vga_get+0x17f/0x1c0 kernel: vga_arb_write+0x121/0x6a0 kernel: ? apparmor_file_permission+0x1c/0x20 kernel: ? security_file_permission+0x30/0x180 kernel: vfs_write+0xca/0x280 kernel: ksys_write+0x67/0xe0 kernel: __x64_sys_write+0x1a/0x20 kernel: do_syscall_64+0x38/0x90 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f93041e02f7 kernel: Code: 75 05 48 83 c4 58 c3 e8 f7 33 ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 kernel: RSP: 002b:00007fff60e49b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f93041e02f7 kernel: RDX: 000000000000000b RSI: 00007fff60e49b40 RDI: 000000000000000f kernel: RBP: 00007fff60e49b40 R08: 00000000ffffffff R09: 00007fff60e499d0 kernel: R10: 00007f93049350b5 R11: 0000000000000246 R12: 000056111d45e808 kernel: R13: 0000000000000000 R14: 000056111d45e7f8 R15: 000056111d46c980 kernel: Modules linked in: nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_seq input_leds snd_seq_device snd_timer snd soundcore joydev kvm_amd serio_raw k10temp mac_hid hp_wmi ccp kvm sparse_keymap wmi_bmof ucsi_acpi efi_pstore typec_ucsi rapl typec video wmi sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_generic usbhid hid amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops ghash_clmulni_intel cec rc_core aesni_intel crypto_simd psmouse cryptd r8169 i2c_piix4 drm ahci xhci_pci realtek libahci xhci_pci_renesas gpio_amdpt gpio_generic kernel: CR2: 0000000000000018 kernel: ---[ end trace 76d04313d4214c51 ]--- Commit 4192f7b57689 ("drm/amdgpu: unmap register bar on device init failure") makes amdgpu_driver_unload_kms() skips amdgpu_device_fini(), so the VGA clients remain registered. So when vga_arbiter_notify_clients() iterates over registered clients, it causes NULL pointer dereference. Since there's no reason to register VGA clients that early, so solve the issue by putting them after all the goto cleanups. v2: - Remove redundant vga_switcheroo cleanup in failed: label. Fixes: 4192f7b57689 ("drm/amdgpu: unmap register bar on device init failure") Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: split mmhub callbacks into ras and non-ras ones	Hawking Zhang
	mmhub ras is only avaiable in cerntain mmhub ip generation. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: indirect register access for nv12 sriov	Peng Ju Zhou
	1. expand rlcg interface for gc & mmhub indirect access 2. add rlcg interface for no kiq v2: squash in fix for gfx9 (Changfeng) Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: indirect register access for nv12 sriov	Peng Ju Zhou
	get pf2vf msg info at it's earliest time so that guest driver can use these info to decide whether register indirect access enabled. Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Reset error code for 'no handler' case	Lijo Lazar
	If reset handler is not implemented, reset error before proceeding. Fixes issue with the following trace - [ 106.508592] amdgpu 0000:b1:00.0: amdgpu: ASIC reset failed with error, -38 for drm dev, 0000:b1:00.0 [ 106.508972] amdgpu 0000:b1:00.0: amdgpu: GPU reset succeeded, trying to resume [ 106.509116] [drm] PCIE GART of 512M enabled. [ 106.509120] [drm] PTB located at 0x0000008000000000 [ 106.509136] [drm] VRAM is lost due to GPU reset! [ 106.509332] [drm] PSP is resuming... Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Enable recovery on aldebaran	Lijo Lazar
	Add aldebaran to devices which support recovery Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Make set PG/CG state functions public	Lijo Lazar
	Expose PG/CG set states functions for other clients Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Add reset control handling to reset workflow	Lijo Lazar
	This prefers reset control based handling if it's implemented for a particular ASIC. If not, it takes the legacy path. It uses the legacy method of preparing environment (job, scheduler tasks) and restoring environment. v2: remove unused variable (Alex) Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amd/amdgpu implement tdr advanced mode	Jack Zhang
	[Why] Previous tdr design treats the first job in job_timeout as the bad job. But sometimes a later bad compute job can block a good gfx job and cause an unexpected gfx job timeout because gfx and compute ring share internal GC HW mutually. [How] This patch implements an advanced tdr mode.It involves an additinal synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit step in order to find the real bad job. 1. At Step0 Resubmit stage, it synchronously submits and pends for the first job being signaled. If it gets timeout, we identify it as guilty and do hw reset. After that, we would do the normal resubmit step to resubmit left jobs. 2. For whole gpu reset(vram lost), do resubmit as the old way. v2: squash in build fix (Alex) Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Convert sysfs sprintf/snprintf family to sysfs_emit	Tian Tao
	Fix the following coccicheck warning: drivers/gpu//drm/amd/amdgpu/amdgpu_ras.c:434:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:220:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:249:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/df_v3_6.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_psp.c:2973:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:75:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:112:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:58:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:93:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:125:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:52:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:71:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:140:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:164:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:186:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_atombios.c:1916:8-16: WARNING: use scnprintf or sprintf Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: move vram recover into sriov full access	Horace Chen
	[what] currently driver recover vram after full access, which may hit a corner case that meanwhile another whole gpu reset may be triggered by another VF, which will cause vram recover fail then fail the whole device reset. [how] move the recover vram into full access. So another bad VF will not disturb the recover sequence for this vf. Signed-off-by: Horace Chen <horace.chen@amd.com> Reviewed by: Monk.Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: skip kfd suspend/resume for S0ix	Alex Deucher
	GFX is in gfxoff mode during s0ix so we shouldn't need to actually tear anything down and restore it. Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: drop S0ix checks around CG/PG in suspend	Alex Deucher
	We handle it properly within the CG/PG functions directly now. Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: skip CG/PG for gfx during S0ix	Pratik Vishwakarma
	Not needed as the device is in gfxoff state so the CG/PG state is handled just like it would be for gfxoff during runtime gfxoff. This should also prevent delays on resume. Reworked from Pratik's original patch (Alex) Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
2021-04-09	drm/amdgpu: update comments about s0ix suspend/resume	Alex Deucher
	Provide and explanation as to why we skip GFX and PSP for S0ix. GFX goes into gfxoff, same as runtime, so no need to tear down and re-init. PSP is part of the always on state, so no need to touch it. Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu/swsmu: skip gfx cgpg on s0ix suspend	Alex Deucher
	The SMU expects CGPG to be enabled when entering S0ix. with this we can re-enable SMU suspend. Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: re-enable suspend phase 2 for S0ix	Alex Deucher
	This really needs to be done to properly tear down the device. SMC, PSP, and GFX are still problematic, need to dig deeper into what aspect of them that is problematic. Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: move s0ix check into amdgpu_device_ip_suspend_phase2 (v3)	Alex Deucher
	No functional change. v2: use correct dev v3: rework Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: clean up non-DC suspend/resume handling	Alex Deucher
	Move the non-DC specific code into the DCE IP blocks similar to how we handle DC. This cleans up the common suspend and resume pathes. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: rework S3/S4/S0ix state handling	Alex Deucher
	Set flags at the top level pmops callbacks to track state. This cleans up the current set of flags and properly handles S4 on S0ix capable systems. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: fix the hibernation suspend with s0ix	Prike Liang
	During system hibernation suspend still need un-gate gfx CG/PG firstly to handle HW status check before HW resource destory. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: disentangle HG systems from vgaswitcheroo	Alex Deucher
	There's no need to keep vgaswitcheroo around for HG systems. They don't use muxes and their power control is handled via ACPI. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	Revert "drm/amdgpu: disable gpu reset on Vangogh for now"	Xiaojian Du
	This reverts commit 33cf440d594bfbf81fc20604957bc64f02d0b560. And it will enable mode-2 gpu reset for vangogh, it asks PSP firmware version is 00.1A.00.0F or newer. Signed-off-by: Xiaojian Du <Xiaojian.Du@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: add codes to capture invalid hardware access when recovery	Dennis Li
	When recovery thread has begun GPU reset, there should be not other threads to access hardware, otherwise system randomly hang. v2 (chk): rewritten from scratch, use trylock and lockdep instead of hand wiring the logic. v3: add in_irq check v4: change to check in_task Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: drop extraneous hw_status update	Alex Deucher
	We set the same variable a few lines above. Drop the duplicate setting. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: Enable light SBR in XGMI+passthrough configuration	shaoyunl
	This is to fix the case where it only enable the light SMU on normal device init. This feature actually need to be enabled after ASIC been reset as well. Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: drop legacy IO bar support	Alex Deucher
	It was leftover from radeon where it was required for some specific old hardware. It hasn't been required for ages and the driver already falls back to MMIO when legacy IO is not available. Legacy IO also seems to be problematic on on some thunderbolt devices. Drop it. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
2021-03-23	drm/amdgpu: nuke the ih reentrant lock	Christian König
	Interrupts on are non-reentrant on linux. This is just an ancient leftover from radeon where irq processing was kicked of from different places. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: fix send ras disable cmd when asic not support ras	Stanley.Yang
	cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: Fix spelling mistake "disabed" -> "disabled"	Colin Ian King
	There is a spelling mistake in a drm debug message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: Enable light SBR for SMU on passthrough and XGMI configuration	shaoyunl
	SMU introduce the new interface to enable light Secondary Bus Reset mode, driver enable it on passthrough + XGMI configuration Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: Reset the devices in the XGMI hive duirng probe	shaoyunl
	In passthrough configuration, hypervisior will trigger the SBR(Secondary bus reset) to the devices without sync to each other. This could cause device hang since for XGMI configuration, all the devices within the hive need to be reset at a limit time slot. This serial of patches try to solve this issue by co-operate with new SMU which will only do minimum house keeping to response the SBR request but don't do the real reset job and leave it to driver. Driver need to do the whole sw init and minimum HW init to bring up the SMU and trigger the reset(possibly BACO) on all the ASICs at the same time Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Acked-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: Add reset_list for device list used for reset	shaoyunl
	The gmc.xgmi.head list originally is designed for device list in the XGMI hive. Mix use it for reset purpose will prevent the reset function to adjust XGMI device list which is required in next change Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: Add kfd init_complete flag to check from amdgpu side	shaoyunl
	amdgpu driver may be in reset state during init which will not initialize the kfd, driver need to initialize the KFD after reset by check the flag Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu: retire aldebaran gpu_info firmware	Hawking Zhang
	driver should use the gfx_info atomfirmware interface Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23	drm/amdgpu:add smu mode1/2 support for aldebaran	Feifei Xu
	Use MSG_GfxDriverReset for mode reset and retire MSG_Mode1Reset. Centralize soc15_asic_mode1_reset() and nv_asic_mode1_reset()functions. Add mode2_reset_is_support() for smu->ppt_funcs. Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-10	drm/amdgpu: set ip blocks for aldebaran	Le Ma
	Set ip blocks and asic family id Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-10	drm/amdgpu: add gpu_info fw parse support for aldebaran	Le Ma
	Parses asic configurations stored in gpu_info firmware and make them available for driver to use. Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-10	drm/amdgpu: add aldebaran asic type	Le Ma
	Add aldebaran in amdgpu_asic_name array and amdgpu_asic_type enum Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-05	drm/amdgpu: Fix some unload driver issues	Emily Deng
	When unloading driver after killing some applications, it will hit sdma flush tlb job timeout which is called by ttm_bo_delay_delete. So to avoid the job submit after fence driver fini, call ttm_bo_lock_delayed_workqueue before fence driver fini. And also put drm_sched_fini before waiting fence. Signed-off-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-05	drm/amd/amdgpu: add fini virt data exchange to ip_suspend	Jingwen Chen
	[Why] when try to shutdown guest vm in sriov mode, virt data exchange is not fini. After vram lost, trying to write vram could hang cpu. [How] add fini virt data exchange in ip_suspend Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Jack Zhang <Jack.Zhang1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-05	drm/amdgpu: enable one vf mode on sienna cichlid vf	Horace Chen
	sienna cichlid needs one vf mode which allows vf to set and get clock status from guest vm. So now expose the required interface and allow some smu request on VF mode. Also since this asic blocked direct MMIO access, use KIQ to send SMU request under sriov vf. OD use same command as getting pp table which is not allowed for sienna cichlid, so remove OD feature under sriov vf. Signed-off-by: Horace Chen <horace.chen@amd.com> Reviewed-by: Monk Liu<monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26	drm/amdgpu: remove unnecessary reading for epprom header	Dennis Li
	If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26	drm/amd/amdgpu: move inc gpu_reset_counter after drm_sched_stop	Jingwen Chen
	Move gpu_reset_counter after drm_sched_stop to avoid race condition caused by job submitted between reset_count +1 and drm_sched_stop. Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-24	drm/amdgpu: fix shutdown and poweroff process failed with s0ix	Prike Liang
	In the shutdown and poweroff opt on the s0i3 system we still need un-gate the gfx clock gating and power gating before destory amdgpu device. Fixes: 628c36d7b238e2 ("drm/amdgpu: update amdgpu device suspend/resume sequence for s0i3 support") Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1499 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-09	drm/amdgpu: enable gpu recovery for dimgrey_cavefish	Tao Zhou
	As dimgrey_cavefish driver is stable enough, set gpu recovery as default in HW hang for dimgrey_cavefish. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Jiansong Chen <Jiansong.Chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-09	drm/amdgpu: use runpm flag rather than fbcon for kfd runtime suspend (v2)	Alex Deucher
	the flag used by kfd is not actually related to fbcon, it just happens to align. Use the runpm flag instead so that we can decouple it from the fbcon flag. v2: fix resume as well Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-09	drm/amdgpu: drop extra drm_kms_helper_poll_enable/disable calls	Alex Deucher
	These are already called in amdgpu_device_suspend/resume which are already called in the same functions. Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>