linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2024-10-11	drm/xe/query: Move timestamp reg to hwe_read_timestamp()	Lucas De Marchi
	__read_timestamps() is actually reading the timestamp from a certain hwe. Use it as parameter, move register declarations to be inside that function and rename it. Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241011035618.1057602-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-10-11	drm/xe/query: Increase timestamp width	Lucas De Marchi
	Starting with Xe2 the timestamp is a full 64 bit counter, contrary to the 36 bit that was available before. Although 36 should be sufficient for any reasonable delta calculation (for Xe2, of about 30min), it's surprising to userspace to get something truncated. Also if the timestamp being compared to is coming from the GPU and the application is not careful enough to apply the width there, a delta calculation would be wrong. Extend it to full 64-bits starting with Xe2. v2: Expand width=64 to media gt, as it's just a wrong tagging in the spec - empirical tests show it goes beyond 36 bits and match the engines for the main gt Bspec: 60411 Cc: Szymon Morek <szymon.morek@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241011035618.1057602-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-10-11	drm/xe: Use bookkeep slots for external BO's in exec IOCTL	Matthew Brost
	Fix external BO's dma-resv usage in exec IOCTL using bookkeep slots rather than write slots. This leaves syncing to user space rather than the KMD blindly enforcing write semantics on every external BO. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Kenneth Graunke <kenneth.w.graunke@intel.com> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Reported-by: Simona Vetter <simona.vetter@ffwll.ch> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2673 Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240911152622.903058-1-matthew.brost@intel.com
2024-10-11	drm/xe: Don't free job in TDR	Matthew Brost
	Freeing job in TDR is not safe as TDR can pass the run_job thread resulting in UAF. It is only safe for free job to naturally be called by the scheduler. Rather free job in TDR, add to pending list. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2811 Cc: Matthew Auld <matthew.auld@intel.com> Fixes: e275d61c5f3f ("drm/xe/guc: Handle timing out of signaled jobs gracefully") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003001657.3517883-3-matthew.brost@intel.com
2024-10-11	drm/xe: Take job list lock in xe_sched_add_pending_job	Matthew Brost
	A fragile micro optimization in xe_sched_add_pending_job relied on both the GPU scheduler being stopped and fence signaling stopped to safely add a job to the pending list without the job list lock in xe_sched_add_pending_job. Remove this optimization and just take the job list lock. Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003001657.3517883-2-matthew.brost@intel.com
2024-10-11	drm/xe/display: Add missing HPD interrupt enabling during non-d3cold RPM resume	Imre Deak
	Atm the display HPD interrupts that got disabled during runtime suspend, are re-enabled only if d3cold is enabled. Fix things by also re-enabling the interrupts if d3cold is disabled. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-5-imre.deak@intel.com
2024-10-11	drm/xe/display: Separate the d3cold and non-d3cold runtime PM handling	Imre Deak
	For clarity separate the d3cold and non-d3cold runtime PM handling. The only change in behavior is disabling polling later during runtime resume. This shouldn't make a difference, since the poll disabling is handled from a work, which could run at any point wrt. the runtime resume handler. The work will also require a runtime PM reference, syncing it with the resume handler. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-4-imre.deak@intel.com
2024-10-10	drm/xe/guc: Fix inverted logic on snapshot->copy check	Colin Ian King
	Currently the check to see if snapshot->copy has been allocated is inverted and ends up dereferencing snapshot->copy when free'ing objects in the array when it is null or not free'ing the objects when snapshot->copy is allocated. Fix this by using the correct non-null pointer check logic. Fixes: d8ce1a977226 ("drm/xe/guc: Use a two stage dump for GuC logs and add more info") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009160510.372195-1-colin.i.king@gmail.com
2024-10-10	drm/xe: fix unbalanced rpm put() with declare_wedged()	Matthew Auld
	Technically the or_reset() means we call the action on failure, however that would lead to unbalanced rpm put(). Move the get() earlier to fix this. It should be extremely unlikely to ever trigger this in practice. Fixes: 452bca0edbd0 ("drm/xe: Don't suspend device upon wedge") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-4-matthew.auld@intel.com
2024-10-10	drm/xe: fix unbalanced rpm put() with fence_fini()	Matthew Auld
	Currently we can call fence_fini() twice if something goes wrong when sending the GuC CT for the tlb request, since we signal the fence and return an error, leading to the caller also calling fini() on the error path in the case of stack version of the flow, which leads to an extra rpm put() which might later cause device to enter suspend when it shouldn't. It looks like we can just drop the fini() call since the fence signaller side will already call this for us. There are known mysterious splats with device going to sleep even with an rpm ref, and this could be one candidate. v2 (Matt B): - Prefer warning if we detect double fini() Fixes: 0a382f9bc5dc ("drm/xe: Hold a PM ref when GT TLB invalidations are inflight") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-3-matthew.auld@intel.com
2024-10-09	drm/xe/xe2lpg: Extend Wa_15016589081 for xe2lpg	Aradhya Bhatia
	Add workaround (wa) 15016589081 which applies to Xe2_v3_LPG_MD. Xe2_v3_LPG_MD is a Lunar Lake platform with GFX version: 20.04. This wa is type: permanent, and hence is applicable on all steppings. Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009065542.283151-1-aradhya.bhatia@intel.com
2024-10-09	drm/xe/xe3: Add initial set of workarounds	Gustavo Sousa
	Implement the initial set of workarounds for Xe3 IPs. Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008204626.55802-2-matthew.s.atwood@intel.com
2024-10-09	drm/xe/tests: Fix the shrinker test compiler warnings.	Thomas Hellström
	The xe_bo_shrink_kunit test has an uninitialized value and illegal integer size conversions on 32-bit. Fix. v2: - Use div64_u64 to ensure the u64 division compiles everywhere. (Matt Auld) Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/20240913195649.GA61514@thelio-3990X/ Fixes: 5a90b60db5e6 ("drm/xe: Add a xe_bo subtest for shrinking / swapping") Cc: dri-devel@lists.freedesktop.org Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> #v1 Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004141121.186177-1-thomas.hellstrom@linux.intel.com
2024-10-09	drm/xe/bmg: improve cache flushing behaviour	Matthew Auld
	The BSpec says that EN_L3_RW_CCS_CACHE_FLUSH must be toggled on for manual global invalidation to take effect and actually flush device cache, however this also turns on flushing for things like pipecontrol, which occurs between submissions for compute/render. This sounds like massive overkill for our needs, where we already have the manual flushing on the display side with the global invalidation. Some observations on BMG: 1. Disabling l2 caching for host writes and stubbing out the driver global invalidation but keeping EN_L3_RW_CCS_CACHE_FLUSH enabled, has no impact on wb-transient-vs-display IGT, which makes sense since the pipecontrol is now flushing the device cache after the render copy. Without EN_L3_RW_CCS_CACHE_FLUSH the test then fails, which is also expected since device cache is now dirty and display engine can't see the writes. 2. Disabling EN_L3_RW_CCS_CACHE_FLUSH, but keeping the driver global invalidation also has no impact on wb-transient-vs-display. This suggests that the global invalidation still works as expected and is flushing the device cache without EN_L3_RW_CCS_CACHE_FLUSH turned on. With that drop EN_L3_RW_CCS_CACHE_FLUSH. This helps some workloads since we no longer flush the device cache between submissions as part of pipecontrol. Edit: We now also have clarification from HW side that BSpec was indeed wrong here. v2: - Rebase and update commit message. BSpec: 71718 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Vitasta Wattal <vitasta.wattal@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241007074541.33937-2-matthew.auld@intel.com
2024-10-08	drm/xe/guc: Save manual engine capture into capture list	Zhanjun Dong
	Save manual engine capture into capture list. This removes duplicate register definitions across manual-capture vs guc-err-capture. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-7-zhanjun.dong@intel.com
2024-10-08	drm/xe/guc: Plumb GuC-capture into dev coredump	Zhanjun Dong
	When we decide to kill a job, (from guc_exec_queue_timedout_job), we could end up with 4 possible scenarios at this starting point of this decision: 1. the guc-captured register-dump is already there. 2. the driver is wedged.mode > 1, so GuC-engine-reset / GuC-err-capture will not happen. 3. the user has started the driver in execlist-submission mode. 4. the guc-captured register-dump is not ready yet so we force GuC to kill that context now, but: A. we don't know yet if GuC will be successful on the engine-reset and get the guc-err-capture, else kmd will do a manual reset later OR B. guc will be successful and we will get a guc-err-capture shortly. So to accomdate the scenarios of 2 and 4A, we will need to do a manual KMD capture first(which is not be reliable in guc-submission mode) and decide later if we need to use that for the cases of 2 or 4A. So this flow is part of the implementation for this patch. Provide xe_guc_capture_get_reg_desc_list to get the register dscriptor list. Add manual capture by read from hw engine if GuC capture is not ready. If it becomes ready at later time, GuC sourced data will be used. Although there may only be a small delay between (1) the check for whether guc-err-capture is available at the start of guc_exec_queue_timedout_job and (2) the decision on using a valid guc-err-capture or manual-capture, lets not take any chances and lock the matching node down so it doesn't get re-claimed if GuC-Err-Capture subsystem is running out of pre-cached nodes. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-6-zhanjun.dong@intel.com
2024-10-08	drm/xe/guc: Extract GuC error capture lists	Zhanjun Dong
	Upon the G2H Notify-Err-Capture event, parse through the GuC Log Buffer (error-capture-subregion) and generate one or more capture-nodes. A single node represents a single "engine- instance-capture-dump" and contains at least 3 register lists: global, engine-class and engine-instance. An internal link list is maintained to store one or more nodes. Because the link-list node generation happen before the call to devcoredump, duplicate global and engine-class register lists for each engine-instance register dump if we find dependent-engine resets in a engine-capture-group. To avoid dynamically allocate the output nodes during gt reset, pre-allocate a fixed number of empty nodes up front (at the time of ADS registration) that we can consume from or return to an internal cached list of nodes. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-5-zhanjun.dong@intel.com
2024-10-08	drm/xe/guc: Add capture size check in GuC log buffer	Zhanjun Dong
	Capture-nodes generated by GuC are placed in the GuC capture ring buffer which is a sub-region of the larger Guc-Log-buffer. Add capture output size check before allocating the shared buffer. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-4-zhanjun.dong@intel.com
2024-10-08	drm/xe/guc: Add XE_LP steered register lists	Zhanjun Dong
	Add the ability for runtime allocation and freeing of steered register list extentions that depend on the detected HW config fuses. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-3-zhanjun.dong@intel.com
2024-10-08	drm/xe/guc: Prepare GuC register list and update ADS size for error capture	Zhanjun Dong
	Add referenced registers defines and list of registers. Update GuC ADS size allocation to include space for the lists of error state capture register descriptors. Then, populate GuC ADS with the lists of registers we want GuC to report back to host on engine reset events. This list should include global, engine-class and engine-instance registers for every engine-class type on the current hardware. Ensure we allocate a persistent storage for the register lists that are populated into ADS so that we don't need to allocate memory during GT resets when GuC is reloaded and ADS population happens again. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-2-zhanjun.dong@intel.com
2024-10-08	drm/xe/xe3lpm: Add new "instance0" steering table	Matt Roper
	MCR steering on Xe3 media IP is almost the same as it was on Xe2, except for one new range (0x38D0D0 - 0x38D0FF) which has changed to an MCR "MEDIAINF" range on Xe3. Since we can always steer to grpid / instanceid 0 for MEDIAINF ranges, define a new "INSTANCE0" steering table for Xe3 media. Xe3 can continue to use the same OADDRM/GPMXMT table as Xe2. v2: Merge continuous entries 38D0D0 - 38F0FF Bspec: 74298 Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-7-matthew.s.atwood@intel.com
2024-10-08	drm/xe/ptl: Add PTL platform definition	Haridhar Kalvala
	PTL is an integrated GPU based on the Xe3 architecture. v2: explicitly turn off display until display patches land. Bspec: 72574 Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Haridhar Kalvala <haridhar.kalvala@intel.com> Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-6-matthew.s.atwood@intel.com
2024-10-08	drm/xe/ptl: PTL re-uses Xe2 MOCS table	Haridhar Kalvala
	PTL is Xe3 architecture but there is no difference between LNL and PTL in MOCS table. So, PTL uses the same MOCS table as LNL. Bspec: 71582 Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by: Haridhar Kalvala <haridhar.kalvala@intel.com> Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-5-matthew.s.atwood@intel.com
2024-10-08	drm/xe/xe3: Define Xe3 feature flags	Haridhar Kalvala
	Define a common set of Xe3 feature flags and definitions that will be used for all platforms in this family. The feature flags are inherited unchanged from the Xe2 (XE2_FEATURES) platform. Following B-spec details inherited from Xe2 feature flag definition commit. v2: reuse graphics_xe2 definition Bspec: 58695 - dma_mask_size remains 46 (not documented in bspec) - supports_usm=1 (Bspec 59651) - has_flatccs=1 (Bspec 58797) - has_4tile=1 (Bspec 58788) - has_asid=1 (Bspec 59654, 59265, 60288) - has_range_tlb_invalidate=1 (Bspec 71126) - five-level page table (Bspec 59505) - 1 VD + 1 VE + 1 SFC (Bspec 67103, 70819) - platform engine mask (Bspec 60149) Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Haridhar Kalvala <haridhar.kalvala@intel.com> Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-3-matthew.s.atwood@intel.com
2024-10-08	drm/xe/xe3: Xe3 uses the same PAT settings as Xe2	Matt Roper
	Xe3 platforms use the same PAT tables as Xe2. Bspec: 71582 Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-2-matthew.s.atwood@intel.com
2024-10-08	drm/xe/ptl: L3bank mask is not available on the media GT	Shekhar Chauhan
	On PTL platforms with media version 30.00, the fuse registers for reporting L3 bank availability to the GT just read out as ~0 and do not provide proper values. Xe does not use the L3 bank mask for anything internally; it only passes the mask through to userspace via the GT topology query. Since we don't have any way to get the real L3 bank mask, we don't want to pass garbage to userspace. Passing a zeroed mask or a copy of the primary GT's L3 bank mask would also be inaccurate and likely to cause confusion for userspace. The best approach is to simply not include L3 in the list of masks returned by the topology query in cases where we aren't able to provide a meaningful value. This won't change the behavior for any existing platforms (where we can always obtain L3 masks successfully for all GTs), it will only prevent us from mis-reporting bad information on upcoming platform(s). There's a good chance this will become a formal workaround in the future, but for now we don't have a lineage number so "no_media_l3" is used in place of a lineage as the OOB workaround descriptor. v2: - Re-calculate query size to properly match data returned. (Gustavo) - Update kerneldoc to clarify that the L3bank mask may not be included in the query results if the hardware doesn't make it available. (Gustavo) Cc: Matt Atwood <matthew.s.atwood@intel.com> Cc: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Shekhar Chauhan <shekhar.chauhan@intel.com> Co-developed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Acked-by: Francois Dugast <francois.dugast@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241007154143.2021124-2-matthew.d.roper@intel.com
2024-10-07	drm/xe/guc: Add a helper function for dumping GuC log to dmesg	John Harrison
	Create a helper function that can be used to dump the GuC log to dmesg in a manner that is reliable for extraction and decode. The intention is that calls to this can be added by developers when debugging specific issues that require a GuC log but do not allow easy capture of the log - e.g. failures in selftests and failues that lead to kernel hangs. Also note that this is really a temporary stop-gap. The aim is to allow on demand creation and dumping of devcoredump captures (which includes the GuC log and much more). Currently this is not possible as much of the devcoredump code requires a 'struct xe_sched_job' and those are not available at many places that might want to do the dump. v2: Add kerneldoc - review feedback from Michal W. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-12-John.C.Harrison@Intel.com
2024-10-07	drm/xe/guc: Add GuC log to devcoredump captures	John Harrison
	Include the GuC log in devcoredump captures because they can be useful with debugging certain types of bug. v2: Fix kerneldoc v3: Drop module parameter as now using more compact ascii85 encoding rather than hexdump (although still not compressed) (review feedback from Matthew B). Rebase onto recent refactoring of devcoredump code. v4: Don't move the submission snapshot inside the GuC internals structure 'cos it really doesn't belong there. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-11-John.C.Harrison@Intel.com
2024-10-07	drm/xe/guc: Dump entire CTB on errors	John Harrison
	The dump of the CT buffers was only showing the unprocessed data which is not generally useful for saying why a hang occurred - because it was probably caused by the commands that were just processed. So save and dump the entire buffer but in a more compact dump format. Also zero fill it on allocation to avoid confusion over uninitialised data in the dump. v2: Add kerneldoc - review feedback from Michal W. v3: Fix kerneldoc. v4: Use ascii85 instead of hexdump (review feedback from Matthew B). v5: Dump the entire CTB object rather than separately dumping just the H2G and G2H sections. That way it includes the full header info. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-10-John.C.Harrison@Intel.com
2024-10-07	drm/xe/guc: Dead CT helper	John Harrison
	Add a worker function helper for asynchronously dumping state when an internal/fatal error is detected in CT processing. Being asynchronous is required to avoid deadlocks and scheduling-while-atomic or process-stalled-for-too-long issues. Also check for a bunch more error conditions and improve the handling of some existing checks. v2: Use compile time CONFIG check for new (but not directly CT_DEAD related) checks and use unsigned int for a bitmask, rename CT_DEAD_RESET to CT_DEAD_REARM and add some explaining comments, rename 'hxg' macro parameter to 'ctb' - review feedback from Michal W. Drop CT_DEAD_ALIVE as no need for a bitfield define to just set the entire mask to zero. v3: Fix kerneldoc v4: Nullify some floating pointers after free. v5: Add section headings and device info to make the state dump look more like a devcoredump to allow parsing by the same tools (eventual aim is to just call the devcoredump code itself, but that currently requires an xe_sched_job, which is not available in the CT code). v6: Fix potential for leaking snapshots with concurrent error conditions (review feedback from Julia F). v7: Don't complain about unexpected G2H messages yet because there is a known issue causing them. Fix bit shift bug with v6 change. Add GT id to fake coredump headers and use puts instead of printf. v8: Disable the head mis-match check in g2h_read because it is failing on various discrete platforms due to unknown reasons. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-9-John.C.Harrison@Intel.com
2024-10-07	drm/print: Introduce drm_line_printer	Michal Wajdeczko
	This drm printer wrapper can be used to increase the robustness of the captured output generated by any other drm_printer to make sure we didn't lost any intermediate lines of the output by adding line numbers to each output line. Helpful for capturing some crash data. v2: Extended short int counters to full int (JohnH) Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: dri-devel@lists.freedesktop.org Reviewed-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-8-John.C.Harrison@Intel.com
2024-10-07	drm/xe/guc: Use a two stage dump for GuC logs and add more info	John Harrison
	Split the GuC log dump into a two stage snapshot and print mechanism. This allows the log to be captured at the point of an error (which may be in a restricted context) and then dump it out later (from a regular context such as a worker function or a sysfs file handler). Also add a bunch of other useful pieces of information that can help (or are fundamentally required!) to decode and parse the log. v2: Add kerneldoc and fix a couple of comment typos - review feedback from Michal W. v3: Move chunking code to this patch as it makes the deltas simpler. Fix a bunch of kerneldoc issues. v4: Move the CS frequency out of the coredump snapshot function into the debugfs only code (as that info is already part of the main devcoredump). Add a header to the debugfs log to match the one in the devcoredump to aid processing by a unified tool. Add forcewake to the GuC timestamp read so it actually works. v6: Add colon to GuC version string (review feedback by Julia F). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-7-John.C.Harrison@Intel.com
2024-10-07	drm/xe/guc: Copy GuC log prior to dumping	John Harrison
	Add an extra stage to the GuC log print to copy the log buffer into regular host memory first, rather than printing the live GPU buffer object directly. Doing so helps prevent inconsistencies due to the log being updated as it is being dumped. It also allows the use of the ASCII85 helper function for printing the log in a more compact form than a straight hex dump. v2: Use %zx instead of %lx for size_t prints. v3: Replace hexdump code with ascii85 call (review feedback from Matthew B). Move chunking code into next patch as that reduces the deltas of both. v4: Add a prefix to the ASCII85 output to aid tool parsing. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-6-John.C.Harrison@Intel.com
2024-10-07	drm/xe/devcoredump: Add ASCII85 dump helper function	John Harrison
	There is a need to include the GuC log and other large binary objects in core dumps and via dmesg. So add a helper for dumping to a printer function via conversion to ASCII85 encoding. Another issue with dumping such a large buffer is that it can be slow, especially if dumping to dmesg over a serial port. So add a yield to prevent the 'task has been stuck for 120s' kernel hang check feature from firing. v2: Add a prefix to the output string. Fix memory allocation bug. v3: Correct a string size calculation and clean up a define (review feedback from Julia F). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-5-John.C.Harrison@Intel.com
2024-10-07	drm/xe/devcoredump: Improve section headings and add tile info	John Harrison
	The xe_guc_exec_queue_snapshot is not really a GuC internal thing and is definitely not a GuC CT thing. So give it its own section heading. The snapshot itself is really a capture of the submission backend's internal state. Although all it currently prints out is the submission contexts. So label it as 'Contexts'. If more general state is added later then it could be change to 'Submission backend' or some such. Further, everything from the GuC CT section onwards is GT specific but there was no indication of which GT it was related to (and that is impossible to work out from the other fields that are given). So add a GT section heading. Also include the tile id of the GT, because again significant information. Lastly, drop a couple of unnecessary line feeds within sections. v2: Add GT section heading, add tile id to device section. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-4-John.C.Harrison@Intel.com
2024-10-07	drm/xe/devcoredump: Use drm_puts and already cached local variables	John Harrison
	There are a bunch of calls to drm_printf with static strings. Switch them to drm_puts instead. There are also a bunch of 'coredump->snapshot.XXX' references when 'coredump->snapshot' has alread been cached locally as 'ss'. So use 'ss->XXX' instead. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-3-John.C.Harrison@Intel.com
2024-10-07	drm/xe/guc: Remove spurious line feed in debug print	John Harrison
	Including line feeds at the start of a debug print messes up the output when sent to dmesg. The break appears between all the useful prefix information and the actual string being printed. In this case, each block of data has a very clear start line and an extra delimeter is really not necessary. So don't do it. v2: Fix typo in commit message (review feedback from Michal W.) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-2-John.C.Harrison@Intel.com
2024-10-07	drm/xe/pf: Allow to save and restore VF config blob from debugfs	Michal Wajdeczko
	For feature enabling and testing purposes, allow to capture and replace full VF configuration using debugfs blob file, but only under strict debug config. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-6-michal.wajdeczko@intel.com
2024-10-07	drm/xe/pf: Add functions to save and restore VF configuration blob	Michal Wajdeczko
	We already have support to save and restore GuC VF state, but that will only work when the target VF configuration (provisioning) will be exactly the same as the source VF configuration. To help with assuring that both configurations match, allow to encode whole VF configuration that can be saved as blob and restored later. In the future we may want to use such captured configuration blobs as templates to make sure we provision VFs with exactly the same configuration that was previously tested or recommended, or when debugfs knobs are not be available. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-5-michal.wajdeczko@intel.com
2024-10-07	drm/xe/pf: Allow to encode subset of VF configuration KLVs	Michal Wajdeczko
	We want to reuse format of the GuC KLVs while saving and restoring VF configuration by the PF driver, but some of those KLVs (like doorbell begin index or GGTT starting offset) are not strictly needed to correctly restore VF configuration. Modify functions to omit encoding of some of the KLVs with GuC only details. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-4-michal.wajdeczko@intel.com
2024-10-07	drm/xe/pf: Update success code of pf_validate_vf_config()	Michal Wajdeczko
	This function may return negative error codes on invalid or incomplete VF configuration, but unlike other int functions, it was returning 1 instead of 0 on success, which might be little inconvinient if we would like to use it directly in other functions. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-3-michal.wajdeczko@intel.com
2024-10-07	drm/xe/guc: Add yet another helper macro for threshold	Michal Wajdeczko
	We already have and use MAKE_GUC_KLV_VF_CFG_THRESHOLD_KEY, but in upcoming patch we want to generate the key for the length. Add MAKE_GUC_KLV_VF_CFG_THRESHOLD_LEN for this purpose. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-2-michal.wajdeczko@intel.com
2024-10-06	drm/xe: Add memirq report page address helpers	Michal Wajdeczko
	Both xe_memirq_{source,status}_ptr functions are now strictly for obtaining an address of the memory based interrupt report pages used by the HW engines. When initializing the GuC that does not require special per instance page preparation, we don't need to abuse these public functions and pass a NULL instead of valid hwe pointer. Also, without further fixes, this actually may lead to NPD crash once the hw_reports_to_instance_zero() will be true. Add internal helpers that will provide report page addresses based solely on the instance number, which will be always 0 for both GuCs. Fixes: ef6103d20f97 ("drm/xe: memirq infra changes for MSI-X") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Ilia Levi <ilia.levi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240925084445.1495-1-michal.wajdeczko@intel.com
2024-10-04	drm/xe: Make wedged_mode debugfs writable	Matt Roper
	The intent of this debugfs entry is to allow modification of wedging behavior, either from IGT tests or during manual debug; it should be marked as writable to properly reflect this. In practice this hasn't caused a problem because we always access wedged_mode as root, which ignores file permissions, but it's still misleading to have the entry incorrectly marked as RO. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Fixes: 6b8ef44cc0a9 ("drm/xe: Introduce the wedged_mode debugfs") Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241002230620.1249258-2-matthew.d.roper@intel.com
2024-10-04	Merge drm/drm-next into drm-xe-next	Thomas Hellström
	Backmerging to resolve a conflict with core locally. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2024-10-03	drm/xe: Restore GT freq on GSC load error	Vinay Belgaumkar
	As part of a Wa_22019338487, ensure that GT freq is restored even when GSC reload is not successful. Fixes: 3b1592fb7835 ("drm/xe/lnl: Apply Wa_22019338487") Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240925204918.1989574-1-vinay.belgaumkar@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-10-03	drm/xe: Use fault injection infrastructure to find issues at probe time	Francois Dugast
	The kernel fault injection infrastructure is used to test proper error handling during probe. The return code of the functions using ALLOW_ERROR_INJECTION() can be conditionnally modified at runtime by tuning some debugfs entries. This requires CONFIG_FUNCTION_ERROR_INJECTION (among others). One way to use fault injection at probe time by making each of those functions fail one at a time is: FAILTYPE=fail_function DEVICE="0000:00:08.0" # depends on the system ERRNO=-12 # -ENOMEM, can depend on the function echo N > /sys/kernel/debug/$FAILTYPE/task-filter echo 100 > /sys/kernel/debug/$FAILTYPE/probability echo 0 > /sys/kernel/debug/$FAILTYPE/interval echo -1 > /sys/kernel/debug/$FAILTYPE/times echo 0 > /sys/kernel/debug/$FAILTYPE/space echo 1 > /sys/kernel/debug/$FAILTYPE/verbose modprobe xe echo $DEVICE > /sys/bus/pci/drivers/xe/unbind grep -oP "^.* \[xe\]" /sys/kernel/debug/$FAILTYPE/injectable \| \ cut -d ' ' -f 1 \| while read -r FUNCTION ; do echo "Injecting fault in $FUNCTION" echo "" > /sys/kernel/debug/$FAILTYPE/inject echo $FUNCTION > /sys/kernel/debug/$FAILTYPE/inject printf %#x $ERRNO > /sys/kernel/debug/$FAILTYPE/$FUNCTION/retval echo $DEVICE > /sys/bus/pci/drivers/xe/bind done rmmod xe It will also be integrated into IGT for systematic execution by CI. v2: Wrappers are not needed in the cases covered by this patch, so remove them and use ALLOW_ERROR_INJECTION() directly. v3: Document the use of fault injection at probe time in xe_pci_probe and refer to it where ALLOW_ERROR_INJECTION() is used. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240927151207.399354-1-francois.dugast@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-10-03	drm/xe/ct: drop irq usage of xa_erase()	Matthew Auld
	Unclear why disabling interrupts is needed here. Nothing seems to be touching fence_lookup and its corresponding lock from an irq so there should be no risk of deadlock. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-8-matthew.auld@intel.com
2024-10-03	drm/xe/guc_submit: fix xa_store() error checking	Matthew Auld
	Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-7-matthew.auld@intel.com
2024-10-03	drm/xe/ct: fix xa_store() error checking	Matthew Auld
	Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-6-matthew.auld@intel.com