Age | Commit message (Collapse) | Author |
|
The driver already uses the dev_err_probe() helper (that only prints error
messages for the -EPROBE_DEFER case) when fails to get any other resource.
Also do the same when it fails to obtain the reset GPIO.
Reviewed-by: Marcus Folkesson <marcus.folkesson@gmail.com>
Link: https://lore.kernel.org/r/20250715110411.448343-3-javierm@redhat.com
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
|
|
It seems the driver took some inspiration from ssd130x and some of the
functions (encoder callbacks) were not renamed to use a st7571_ prefix.
Reviewed-by: Marcus Folkesson <marcus.folkesson@gmail.com>
Link: https://lore.kernel.org/r/20250715110411.448343-2-javierm@redhat.com
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
|
|
The test is quite fragile since it tries to allocate halve available system
memory + 1 page.
If the system has either not enough memory to make the allocation work
with other things running in parallel or to much memory so the allocation
fails as to large/invalid the test will fail.
Completely remove the test. We already validate swapout on the device
level and that test seems to be stable.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250710144129.1803-2-christian.koenig@amd.com
|
|
The test works even without it, but lockdep starts screaming when it is
activated.
Trivially fix it by acquiring the lock before we try to allocate
something.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250710144129.1803-1-christian.koenig@amd.com
|
|
The function radeon_resume_kms() acquires the console lock. It is
inconsistent, as it depends on the notify_client argument. That
lock then covers a number of suspend operations that are unrelated
to the console.
Remove the calls to console_lock() and console_unlock() from the
radeon function. The console lock is only required by DRM's fbdev
emulation, which acquires it as necessary.
Also fixes a possible circular dependency between the console lock
and the client-list mutex, where the mutex is supposed to be taken
first.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The radeon driver holds the console lock while suspending in-kernel
DRM clients. This creates a circular dependency with the client-list
mutex, which is supposed to be acquired first. Reported when combining
radeon with another DRM driver.
Therefore, do not take the console lock in radeon, but let the fbdev
DRM client acquire the lock when needed. This is what all other DRM
drivers so.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reported-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Closes: https://lore.kernel.org/dri-devel/0a087cfd-bd4c-48f1-aa2f-4a3b12593935@oss.qualcomm.com/
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
when loading bad page in the same nps mode, need to set the other fields
fields in eeprom records manually besides retired_page
Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
add eeprom data checksum check before driver unload. reset eeprom
and save correct data to eeprom when check failed
Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
In DCN401 pre-blending degamma LUT isn't affecting cursor as in previous
DCN version. As this is not the behavior close to what is expected for
CRTC degamma LUT, disable CRTC degamma LUT property in this HW.
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/4176
---
When enabling HDR on KDE, it takes the first CRTC 1D LUT available and
apply a color transformation (Gamma 2.2 -> PQ). AMD driver usually
advertises a CRTC degamma LUT as the first CRTC 1D LUT, but it's
actually applied pre-blending. In previous HW version, it seems to work
fine because the 1D LUT was applied to cursor too, but DCN401 presents a
different behavior and the 1D LUT isn't affecting the hardware cursor.
To address the wrong gamma on cursor with HDR (see the link), I came up
with this patch that disables CRTC degamma LUT in this hw, since it
presents a different behavior than others. With this KDE sees CRTC
regamma LUT as the first post-blending 1D LUT available. This is
actually more consistent with AMD color pipeline. It was tested by the
reporter, since I don't have the HW available for local testing and
debugging.
Melissa
---
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This reverts commit 99e25e4683d7cfdf79dcc328e11bb6c924c77566.
[Why & How]
This commit caused a blank screen on internal display when projecting to
an external display on DCN314.
Reviewed-by: Aric Cyr <aric.cyr@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
SmartMux is a mechanism to switch the GPU being used for scanout in a
hybrid configuration. This is used for devices with an eDP and two GPUs.
This is only valid when the system has a physical switch (Multiplexer)
in the board to switch between the two GPUs.
When a graphically intensive workload like a game is being run, the
system can be switch the active display to the dGPU, so that we can
avoid copying the buffer from dGPU to APU for scanout. This helps with
latency and FPS. When power consumption is preferred, the system can be
switched to the APU.
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[WHY]
Free memory to avoid memory leak
Reviewed-by: Joshua Aberback <joshua.aberback@amd.com>
Signed-off-by: Clayton King <clayton.king@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[why & how]
UHBR link rate capable eDPs will use HPO for encoding. Need to pass
HPO stream and link encoder instances to DMCUB for Replay FSM to
know which instances to use.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Signed-off-by: Ovidiu Bunea <Ovidiu.Bunea@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[why & how]
DP1 eDP is still considered a single-eDP case and should support Panel Replay.
Modify secondary eDP policy to reflect this and update Replay state accordingly.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Signed-off-by: Ovidiu Bunea <Ovidiu.Bunea@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[WHY]
For SQ128 pattern some vendor-specific overrides are required.
Previously a hardcoded clock gen source value was incorrectly programmed,
causing our override to retimer's clock source override to be ignored.
Due to some PHY issues on certain APU programs, we see failures on retimer
bypass ports extend to electrical testing downstream of PHY due to some host
clock jitter which the retimer follows.
[HOW]
Fix typo to use correct clock gen source override of 0xC4 rather than 0x4C.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Signed-off-by: Michael Strauss <michael.strauss@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why & How]
To facilitate debugging, the following behaviors are defined for existing
debug option disable_ips_in_vpb
0 - Enable IPS in LVP - let driver decide (legacy)
1 - Disable IPS in LVP
2 - Enable IPS1 and RCG in LVP
3 - Enable IPS1 Z8, IPS1 and RCG in LVP
Reviewed-by: Duncan Ma <duncan.ma@amd.com>
Signed-off-by: Leo Chen <leo.chen@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why & How]
Add static pg implementations and debug flags for future use.
Reviewed-by: Duncan Ma <duncan.ma@amd.com>
Signed-off-by: Leo Chen <leo.chen@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why & How]
DMUB shall be notified on driver hardware
release. Implement notification.
Reviewed-by: Duncan Ma <duncan.ma@amd.com>
Signed-off-by: Duncan Ma <Duncan.Ma@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why & How]
Aligned IPS FW state with DMCUB IPS FW state
Added debug option disable_ips_rcg to modify RCG behaviour in IPS modes.
Updated existing debug option disable_ips to align with new changes introduced by IPSv2.0
Reviewed-by: Duncan Ma <duncan.ma@amd.com>
Signed-off-by: Leo Chen <leo.chen@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why & How]
Display idle notification shall
be sent by driver on D3 entry. Implement
notification to DMUB and PMFW.
Reviewed-by: Duncan Ma <duncan.ma@amd.com>
Signed-off-by: Duncan Ma <Duncan.Ma@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[why]
dc has some code out of sync:
dc_commit_updates_for_stream handles v1/v2/v3,
but dc_update_planes_and_stream makes v1 asic to use v2.
as a reression fix: limit clear_update_flags to dcn32 or newer asic.
need to follow up that v1 asic using v2 issue.
Reviewed-by: Syed Hassan <syed.hassan@amd.com>
Signed-off-by: Charlene Liu <Charlene.Liu@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why & How]
Some monitor have audio output but SAB data is zero. Skip
check this in this case.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Reviewed-by: Jun Lei <jun.lei@amd.com>
Signed-off-by: Fudongwang <Fudong.Wang@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why/How]
Add the timing source needed to support DID Type5.
Reviewed-by: Aric Cyr <aric.cyr@amd.com>
Signed-off-by: Ilya Bakoulin <Ilya.Bakoulin@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[WHY]
DSC block level should only be responsible for reporting single DSC
instance capabilities. Factoring in ODM combine requirements should be
handled in dc_dsc.c. Both components should acquire clocks from clk_mgr
to determine throughput capabilities instead of relying on hard coded
values as these can differ by SoC and SKU.
[HOW]
1) Add dsc_get_single_enc_caps to acquire single DSC instance
capabilities (replacing dsc_get_enc_caps), factoring in DSCCLK
2) add build_dsc_enc_caps to combine single DSC instance capabilities
3) account for max pixel rate per pipe (DISPCLK) when calculating
minimum slice count
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
compilation units
[Why & How]
Expose dcn401_initialize_min_clocks() for future use and add additional
check for IP register.
Reviewed-by: Nevenko Stupar <nevenko.stupar@amd.com>
Signed-off-by: Karthi Kandasamy <karthi.kandasamy@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[WHY & HOW]
Adding basic logic to allocate unused RMCM block and TMZ support.
Reviewed-by: Krunoslav Kovac <krunoslav.kovac@amd.com>
Signed-off-by: Yihan Zhu <Yihan.Zhu@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
When booting without an HDMI display connected, the I2C registers
are not initialized correctly, leading to DC_I2C_ARBITRATION register
getting stuck with DC_I2C_REG_RW_CNTL_STATUS == USED_BY_SW.
[How]
* Correct TOCTOU race condition in engine acquire logic which did not
check against DMUB trying to acquire it at the same time.
* Deassert SOFT_RESET before acquire, as it can block access to other
I2C registers.
* Add a workaround in release, checking that after triggerring
DC_I2C_SW_DONE_USING_I2C_REG, DC_I2C_REG_RW_CNTL_STATUS != USED_BY_SW.
If necessary, trigger DC_I2C_SW_DONE_USING_I2C_REG again.
* Remove unnecessary clear of DC_I2C_SW_USE_I2C_REG_REQ, which engine
ignores according to specification.
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Dominik Kaszewski <dominik.kaszewski@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
When the driver is unloaded, the interrupt source of
the rma device is not released, resulting in the failure
of hw_init when loading again using bad_page_threshold.
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
KFD has been confirmed that can run on LoongArch systems.
It's necessary to support CONFIG_HSA_AMD on LoongArch.
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Start and stop can fail, so add checks.
Fixes: b54695dae995 ("drm/amd: Add per-ring reset for vcn v5.0.0 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
|
|
Start and stop can fail, so add checks.
Fixes: d1a46cdd0053 ("drm/amd: Add per-ring reset for vcn v4.0.5 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
|
|
Start and stop can fail, so add checks.
Fixes: b8b6e6f1654d ("drm/amd: Add per-ring reset for vcn v4.0.0 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
|
|
The ring test needs to be inside the lock.
Fixes: 097af47d3cfb ("drm/amdgpu/gfx10: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
|
|
The ring test needs to be inside the lock.
Fixes: 4c953e53cc34 ("drm/amdgpu/gfx_9.4.3: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
|
|
The ring test needs to be inside the lock.
Fixes: fdbd69486b46 ("drm/amdgpu/gfx9: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
|
|
For current partition mode queries, return the mode cached in partition
manager whenever it's valid.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
TRAINING_PATTERN_SET
Commit a3ef3c2da675 ("drm/dp: Change AUX DPCD probe address from
DPCD_REV to LANE0_1_STATUS") stopped using the DPCD_REV register for
DPCD probing, since this results in link training failures at least when
using an Intel Barlow Ridge TBT hub at UHBR link rates (the
DP_INTRA_HOP_AUX_REPLY_INDICATION never getting cleared after the failed
link training). Since accessing DPCD_REV during link training is
prohibited by the DP Standard, LANE0_1_STATUS (0x202) was used instead,
as it falls within the Standard's valid register address range
(0x102-0x106, 0x202-0x207, 0x200c-0x200f, 0x2216) and it fixed the link
training on the above TBT hub.
However, reading the LANE0_1_STATUS register also has a side-effect at
least on a Novatek eDP panel, as reported on the Closes: link below,
resulting in screen flickering on that panel. One clear side-effect when
doing the 1-byte probe reads from LANE0_1_STATUS during link training
before reading out the full 6 byte link status starting at the same
address is that the panel will report the link training as completed
with voltage swing 0. This is different from the normal, flicker-free
scenario when no DPCD probing is done, the panel reporting the link
training complete with voltage swing 2.
Using the TRAINING_PATTERN_SET register for DPCD probing doesn't have
the above side-effect, the panel will link train with voltage swing 2 as
expected and it will stay flicker-free. This register is also in the
above valid register range and is unlikely to have a side-effect as that
of LANE0_1_STATUS: Reading LANE0_1_STATUS is part of the link training
CR/EQ sequences and so it may cause a state change in the sink - even if
inadvertently as I suspect in the case of the above Novatek panel. As
opposed to this, reading TRAINING_PATTERN_SET is not part of the link
training sequence (it must be only written once at the beginning of the
CR/EQ sequences), so it's unlikely to cause any state change in the
sink.
As a side-note, this Novatek panel also lacks support for TPS3, while
claiming support for HBR2, which violates the DP Standard (the Standard
mandating TPS3 for HBR2).
Besides the Novatek panel (PSR 1), which this change fixes, I also
verified the change on a Samsung (PSR 1) and an Analogix (PSR 2) eDP
panel as well as on the Intel Barlow Ridge TBT hub.
Note that in the drm-tip tree (targeting the v6.17 kernel version) the
i915 and xe drivers keep DPCD probing enabled only for the panel known
to require this (HP ZR24w), hence those drivers in drm-tip are not
affected by the above problem.
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Fixes: a3ef3c2da675 ("drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS")
Reported-and-tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14558
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20250708212331.112898-1-imre.deak@intel.com
(cherry picked from commit bba9aa41654036534d86b198f5647a9ce15ebd7f)
[Imre: Rebased on drm-intel-fixes]
Signed-off-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Rodrigo: Changed original commit hash to match with the one propagated in fixes]
|
|
bridge->detect parameter changes
Fix the compile error due to bridge->detect parameter changes.
Reported-by: Dixit Ashutosh <ashutosh.dixit@intel.com>
Closes: https://lore.kernel.org/dri-devel/175250667117.3567548.8371527247937906463.b4-ty@oss.qualcomm.com/T/#m8ecd00a05a330bc9c76f11c981daafcb30a7c2e0
Fixes: 5d156a9c3d5e ("drm/bridge: Pass down connector to drm bridge detect hook")
Signed-off-by: Andy Yan <andyshrk@163.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250715054754.800765-1-andyshrk@163.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
|
|
Panfrost can skip the reset if TDR has fired before the free-job worker.
Currently, since Panfrost doesn't take any action on these scenarios, the
job is being leaked, considering that `free_job()` won't be called.
To avoid such leaks, inform the scheduler that the job did not actually
timeout and no reset was performed through the new status code
DRM_GPU_SCHED_STAT_NO_HANG.
Reviewed-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-8-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
Xe can skip the reset if TDR has fired before the free job worker and can
also re-arm the timeout timer in some scenarios. Instead of manipulating
scheduler's internals, inform the scheduler that the job did not actually
timeout and no reset was performed through the new status code
DRM_GPU_SCHED_STAT_NO_HANG.
Note that, in the first case, there is no need to restart submission if it
hasn't been stopped.
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-7-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
Etnaviv can skip a hardware reset in two situations:
1. TDR has fired before the free-job worker and the timeout is spurious.
2. The GPU is still making progress on the front-end and we can give
the job a chance to complete.
Instead of manipulating scheduler's internals, inform the scheduler that
the job did not actually timeout and no reset was performed through
the new status code DRM_GPU_SCHED_STAT_NO_HANG.
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-6-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
When a CL/CSD job times out, we check if the GPU has made any progress
since the last timeout. If so, instead of resetting the hardware, we skip
the reset and allow the timer to be rearmed. This gives long-running jobs
a chance to complete.
Instead of manipulating scheduler's internals, inform the scheduler that
the job did not actually timeout and no reset was performed through
the new status code DRM_GPU_SCHED_STAT_NO_HANG.
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-5-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
Add a test to submit a single job against a scheduler with the timeout
configured and verify that if the job is still running, the timeout
handler will skip the reset and allow the job to complete.
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-4-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
As more KUnit tests are introduced to evaluate the basic capabilities of
the `timedout_job()` hook, the test suite will continue to increase in
duration. To reduce the overall running time of the test suite, decrease
the scheduler's timeout for the timeout tests.
Before this commit:
[15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s building, 5.229s running
After this commit:
[15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s building, 4.037s running
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Acked-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-3-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job just took unusually long (longer than the timeout) but is
still running, and there is, thus, no reason to reset the hardware. This
can occur in two scenarios:
1. The job is taking longer than the timeout, but the driver determined
through a GPU-specific mechanism that the hardware is still making
progress. Hence, the driver would like the scheduler to skip the
timeout and treat the job as still pending from then onward. This
happens in v3d, Etnaviv, and Xe.
2. Timeout has fired before the free-job worker. Consequently, the
scheduler calls `sched->ops->timedout_job()` for a job that isn't
timed out.
These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()` - leading to a memory leak.
To solve these problems, create a new `drm_gpu_sched_stat`, called
DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
new status will indicate that the job must be reinserted into
`sched->pending_list`, and the hardware / driver will still complete that
job.
Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-2-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
Among the scheduler's statuses, the only one that indicates an error is
DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV
signifies that the operation succeeded and the GPU is in a nominal state.
However, to provide more information about the GPU's status, it is needed
to convey more information than just "OK".
Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to
DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this
status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has
hung, but it has been successfully reset and is now in a nominal state
again.
Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-1-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
Once we finish populating all leaf pages in the VF's LMTT we should
make sure that hardware will not access any stale data. Explicitly
force LMTT invalidation (as it was already planned in the past).
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20250711193316.1920-7-michal.wajdeczko@intel.com
|
|
Invalidate LMTT immediately after removing VF's LMTT page tables
and clearing root PTE in the LMTT PD to avoid any invalid access
by the hardware (and VF) due to stale data.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250711193316.1920-6-michal.wajdeczko@intel.com
|
|
By default the GuC starts in the 'native' mode and enables the VGT
mode (aka 'virtualization' mode) only after it receives at least one
set of VF configuration data. While this happens naturally while PF
begins VFs provisioning, we might need this sooner as some actions,
like TLB_INVALIDATION_ALL(0x7002), is supported by the GuC only in
the VGT mode.
And this becomes a real problem if we would want to use above action
to invalidate the LMTT early during VFs auto-provisioning, before VFs
are enabled, as such H2G would be rejected:
[ ] xe 0000:4d:00.0: [drm] *ERROR* GT0: FAST_REQ H2G fence 0x804e failed! e=0x30, h=0
[ ] xe 0000:4d:00.0: [drm] *ERROR* GT0: Fence 0x804e was used by action 0x7002 sent at:
h2g_write+0x33e/0x870 [xe]
__guc_ct_send_locked+0x1e1/0x1110 [xe]
guc_ct_send_locked+0x9f/0x740 [xe]
xe_guc_ct_send_locked+0x19/0x60 [xe]
send_tlb_invalidation+0xc2/0x470 [xe]
xe_gt_tlb_invalidation_all_async+0x45/0xa0 [xe]
xe_gt_tlb_invalidation_all+0x4b/0xa0 [xe]
lmtt_invalidate_hw+0x64/0x1a0 [xe]
xe_lmtt_invalidate_hw+0x5c/0x340 [xe]
pf_update_vf_lmtt+0x398/0xae0 [xe]
pf_provision_vf_lmem+0x350/0xa60 [xe]
xe_gt_sriov_pf_config_bulk_set_lmem+0xe2/0x410 [xe]
xe_gt_sriov_pf_config_set_fair_lmem+0x1c6/0x620 [xe]
xe_gt_sriov_pf_config_set_fair+0xd5/0x3f0 [xe]
xe_pci_sriov_configure+0x360/0x1200 [xe]
sriov_numvfs_store+0xbc/0x1d0
dev_attr_store+0x17/0x40
sysfs_kf_write+0x4a/0x80
kernfs_fop_write_iter+0x166/0x220
vfs_write+0x2ba/0x580
ksys_write+0x77/0x100
__x64_sys_write+0x19/0x30
x64_sys_call+0x2bf/0x2660
do_syscall_64+0x93/0x7a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ ] xe 0000:4d:00.0: [drm] *ERROR* GT0: CT dequeue failed: -71
[ ] xe 0000:4d:00.0: [drm] GT0: trying reset from receive_g2h [xe]
This could be mitigated by pushing earlier a PF self-configuration
with some hard-coded values that cover unlimited access to the GGTT,
use of all GuC contexts and doorbells. This step is sufficient for
the GuC to switch into the VGT mode.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20250711193316.1920-5-michal.wajdeczko@intel.com
|
|
In upcoming patch we will want to encode GGTT config KLVs based
on raw numbers, without relying on the allocated GGTT node.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20250711193316.1920-4-michal.wajdeczko@intel.com
|