linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2022-07-12	habanalabs/gaudi: collect undefined opcode error info	Tal Cohen
	when an undefined opcode error occurres, the driver collects the relevant information from the Qman and stores it inside the hdev data structure. An event fd indication is sent towards the user space. Note: another commit shall be followed which will add support to read the error info by an ioctl. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12	habanalabs/gaudi: fix comment to reflect current code	Oded Gabbay
	Due to code changes in the past few years, the original comment of how parser->user_cb_size is checked was not correct anymore. Fix it to reflect current code and add more explanation as the code is more complex now. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12	habanalabs: change the write flag name of error info structs	Tal Cohen
	positive flags naming will make more clear code while adding more 'error info' structures Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12	habanalabs/gaudi: move tpc assert raise into internal func	Tal Cohen
	raising the tpc assert event in an internal function will make the code cleaner as we are going to be adding more events Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12	habanalabs: add terminating NULL to attrs arrays	Dafna Hirschfeld
	Arrays of struct attribute are expected to be NULL terminated. This is required by API methods such as device_add_groups. This fixes a crash when loading the driver for Goya device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-05-22	habanalabs: use separate structure info for each error collect data	Tal Cohen
	Create separate info structure for each error type. The structures shall be used inside the large structure that contains the last session error. This is more scalable for adding more errors in the future. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: do MMU prefetch as deferred work	Ohad Sharabi
	When user requests to prefetch the MMU translations, the driver will not block the user until prefetch is done. Instead, the prefetch work will be delegated to a WQ which will do it in the background. This way, the prefetch may progress without blocking the user at all. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: add support for notification via eventfd	Tal Cohen
	The driver will be able to send notification events towards a user process, using user's registered event file descriptor. The driver uses the notification mechanism to inform the user about an occurred event. A user thread can wait until a notification is received from the driver. The driver stores the occurred event until the user reads it, using HL_INFO_GET_EVENTS - new ioctl opcode in the INFO ioctl. Gaudi specific implementation includes sending a notification on a TPC assertion event that is received from f/w. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: add device memory scrub ability through debugfs	Dafna Hirschfeld
	Add the ability to scrub the device memory with a given value. Add file 'dram_mem_scrub_val' to set the value and a file 'dram_mem_scrub' to scrub the dram. This is very important to help during automated tests, when you want the CI system to randomize the memory before training certain DL topologies. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: use unified memory manager for CB flow	Yuri Nudelman
	With the new code required for the flow added, we can now switch to using the new memory manager infrastructure, removing the old code. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs/gaudi: set arbitration timeout to a high value	Ofir Bitton
	In certain workloads, arbitration timeout might expire although no actual issue present. Hence, we set timeout to a very high value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: use for_each_sgtable_dma_sg for dma sgt	Ohad Sharabi
	Instead of using for_each_sg when iterating sgt that contains dma entries, use the more proper for_each_sgtable_dma_sg macro. In addition, both Goya and Gaudi have the exact same implementation of the asic function that encapsulate the usage of this macro, so it is better to move that implementation to the common code. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs/gaudi: use lower_32_bits() for casting	Rajaravi Krishna Katta
	Use standard kernel macro to take lower 32 bits of 64-bits variable. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: change a reset print to debug level	Oded Gabbay
	Currently we have two reset prints per reset. One is in the common code and one in each asic-specific file. We can change the asic-specific message to be debug only as we can know the type of reset being done according to the print in the common code, which is also easier to maintain. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: remove redundant info print	Oded Gabbay
	Halting compute engines is a print that doesn't add us any information because it is always done in the reset process and not used elsewhere. Even if it was, we don't use prints to mark functions we passed through. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: remove debugfs read/write callbacks	Dafna Hirschfeld
	The debugfs memory access now uses the callback 'access_dev_mem' so there is no use of the callbacks 'debugfs_{read32,read64,write32,write6}'. Remove them. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: add callback and field to be used for debugfs refactor	Dafna Hirschfeld
	This is a preparation for unifying the code of accessing device memory through debugfs. Add struct fields and callbacks that will later be used in debugfs code and will reduce code duplication among the different read{32,64}/write{32,64} callbacks of every asic. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs/gaudi: add debugfs to fetch internal sync status	Ohad Sharabi
	When Gaudi device is secured the monitors data in the configuration space is blocked from PCI access. As we need to enable user to get sync-manager monitors registers when debugging, this patch adds a debugfs that dumps the information to a binary file (blob). When a root user will trigger the dump, the driver will send request to the f/w to fill a data structure containing dump of all monitors registers. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs/gaudi: Use correct sram size macro for debugfs	Dafna Hirschfeld
	We currently allow accessing the whole SRAM bar size with the macro SRAM_BAR_SIZE, but the actual size of the sram region is the macro SRAM_SIZE which is only a portion of the whole bar size. So when accessing the sram through debugfs, use the macro SRAM_SIZE for the sram size which is the correct macro. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: add MMU prefetch to ASIC-specific code	Ohad Sharabi
	This is necessary pre-requisite for future ASIC support, where MMU TLB prefetch is supported. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: modify dma_mask to be ASIC specific property	Tomer Tayar
	The required DMA mask is no longer based on input from the F/W, but it is fixed per ASIC according to its address space. As such, the per-ASIC function to get this value can be replaced with a property variable. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs/gaudi: avoid resetting max power in hard reset	Tomer Tayar
	The default max power is deduced from the card type value in the CPU-CP info. This value is then set in the max power variable of the device structure. Getting the CPU-CP info is done as part of the late init phase which is called also during reset. This means that a max power value which is modified via sysfs will be reset during hard reset back to the default value. As the max power is updated in any case during device init in hl_sysfs_init(), this setting in late init can be removed, and the overriding during reset is thus avoided. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: add user API to get valid DRAM page sizes	Ohad Sharabi
	Future devices will support multiple device memory page sizes. In addition, an API for the user was added for it to be able to control the device memory allocation page size. This patch is a complementary patch to inform the user of the available page size supported by the device. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: convert all MMU masks/shifts to arrays	Ohad Sharabi
	There is no need to hold each MMU mask/shift as a denoted structure member (e.g. hop0_mask). Instead converting it to array will result in smaller and more readable code. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: change mmu_get_real_page_size to be ASIC-specific	Ohad Sharabi
	This patch breaks the cumbersome implementation of "get real page size" along with it's multiple inner conditions and implement each case (according to the real complexity) inside an ASIC function. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22	habanalabs: set non-0 value in dram default page size	Ohad Sharabi
	Looking forward we will need to report to the user what is the default page size used. This will be done more conveniently by explicitly updating the property rather than to rely on a "0 meaning default" value. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-28	habanalabs: add an option to delay a device reset	Tomer Tayar
	Several H/W events can be sent adjacently, even due to a single error. If a hard-reset is triggered as part of handling one of these events, the following events won't be handled. The debug info from these missed events is important, sometimes even more important than the one that was handled. To allow handling these close events, add an option to delay a device reset and use it when resetting due to H/W events. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs/gaudi: add missing handling of NIC related events	Oded Gabbay
	There are a few events that can arrive from the f/w and without proper handling can cause errors to appear in the kernel log without reason. Add the relevant handling that was missing. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs/gaudi: handle axi errors from NIC engines	Oded Gabbay
	Various AXI errors can occur in the NIC engines and are reported to the driver by the f/w. Add code to print the errors and ack them to the f/w. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: allow user to set allocation page size	Ohad Sharabi
	In future ASICs the MMU will be able to work with multiple page sizes, thus a new flag is added to allow the user to set the requested page size. This flag is added since the whole DRAM is allocated for the user and the user also should be familiar with the memory usage use case. As such, the user may choose to "over allocate" memory in favor of performance (for instance- large page allocations covers more memory in less TLB entries). For example: say available page sizes are of 1MB and 32MB. If user wants to allocate 40MB the user can either set page size to 1MB and allocate the exact amount of memory (but will result in 40 TLB entries) or the user can use 32MB pages, "waste" 8MB of physical memory but occupy only 2 TLB entries. Note that this feature will be available only to ASIC that supports multiple DRAM page sizes. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: set max power on device init per ASIC	Tomer Tayar
	For current devices there is a need to send the max power value to F/W during device init, for example because there might be several card types. In future devices, this info will be programmed in the device's EEPROM and will be read by F/W, and hence the driver should not send it. Modify the sending of the relevant message to be done only for ASIC types that need it. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: enable stop-on-error debugfs setting per ASIC	Tomer Tayar
	On Goya and Gaudi, the stop-on-error configuration can be set via debugfs. However, in future devices, this configuration will always be enabled. Modify the debugfs node to be allowed only for ASICs that support this dynamic configuration. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: duplicate HOP table props to MMU props	Ohad Sharabi
	In order to support several device MMU blocks with different architectures (e.g. different HOP table size) we need to move to per-MMU properties rather than keeping those properties as ASIC properties. Refactoring the code to use "per-MMU proprties" is a major effort. To start making the transition towards this goal but still support taking the properties from ASIC properties (for code that currently uses them) this patch copies some of the properties to the "per-MMU" properties and later, when implementing the per-MMU properties, we would be able to delete the MMU props from the ASIC props. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: use common wrapper for MMU cache invalidation	Oded Gabbay
	We have a common function that wraps the call to the MMU cache invalidation function, which is ASIC-specific. The wrapper checks the return value and prints error if necessary. For consistency, try to use the wrapper when possible. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: remove power9 workaround for dma support	Oded Gabbay
	We don't need this workaround anymore. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: add vrm version to sysfs	Oded Gabbay
	infineon version is only applicable to GOYA and GAUDI. For later ASICs, we display the Voltage Regulator Monitor f/w version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: remove asic callback set_pll_profile()	Oded Gabbay
	Setting PLL profile is the same for all ASICs, except for GOYA. However, because this function is never called from common code, there is no need to have an asic-specific callback function. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: remove hwmgr.c	Oded Gabbay
	The two remaining functions in this file belong to firmware_if.c, as they communicate with the firmware. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: get clk is common function	Oded Gabbay
	Retrieving the clock from the f/w is done exactly the same in ALL our ASICs. Therefore, no real justification for doing it as an ASIC-specific function. The only thing is we need to check if we are running on simulator, which doesn't require ASIC-specific callback. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: sysfs functions should be in sysfs.c	Oded Gabbay
	Move common sysfs store/show functions to sysfs.c file for consistency. This is part of a patch-set to remove hwmgr.c Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs: remove ASIC functions of clock gating	Oded Gabbay
	Now that clock gating is permanently disabled in GAUDI, no need for the ASIC functions of setting and disabling clock gating, as this was a unique scenario in GAUDI. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28	habanalabs/gaudi: disable CGM permanently	Oded Gabbay
	Due to the need of SynapseAI to configure all TPC engines from a single QMAN, the driver must disable CGM and never allow the user to enable it. Otherwise, the configuration of the TPC engines will fail. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: refactor reset information variables	Ofir Bitton
	Unify variables related to device reset, which will help us to add some new reset functionality in future patches. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: clean MMU headers definitions	Ohad Sharabi
	During the MMU development the MMU header files were left with unclean definitions: - MMU "version specific" definitions that were left in the mmu_general file - unused definitions This patch attempts, where possible, to keep definitions that can serve multiple MMU versions (but that are not tightly bound with specific MMU arch) in the mmu_general header file (e.g. different definitions for number of HOPs). Otherwise, move MMU version specific definitions (e.g. HOPs masks and shifts) to the specific MMU version file. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs/gaudi: return EPERM on non hard-reset	Oded Gabbay
	GAUDI supports only hard-reset. Therefore, this function should return an error of operation not permitted. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: rename late init after reset function	Oded Gabbay
	The ASIC-specific soft_reset_late_init() is now called after either soft-reset or reset-upon-device-release. Therefore, it needs a more appropriate name. No need to split it to two functions, as an ASIC either supports soft-reset or reset-upon-device-release. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: Move frequency change thread to goya_late_init	Rajaravi Krishna Katta
	Changing the frequency automatically is only done in Goya. In future ASICs this is done inside the firmware. Therefore, move the common code into the Goya specific files. Main changes as part of the commit are: 1. The thread for setting frequency is moved from device_late_init to goya_late_init 2. hl_device_set_frequency is removed from hl_device_open as it is not relevant for other ASICs and for Goya it is taken care by the thread 3. hl_device_set_frequency is renamed as goya_set_frequency Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: fix possible deadlock in cache invl failure	Ofir Bitton
	Currently there is a deadlock in driver in scenarios where MMU cache invalidation fails. The issue is basically device reset being performed without releasing the MMU mutex. The solution is to skip device reset as it is not necessary. In addition we introduce a slight code refactor that prints the invalidation error from a single location. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: skip PLL freq fetch	Ohad Sharabi
	Getting the used PLL index with which to send the CPUPU packet relies on the CPUCP info packet. In case CPU queues are not enabled getting the PLL index will issue an error and in some ASICs will also fail the driver load. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26	habanalabs: add support for fetching historic errors	Dani Liberman
	A new uAPI is added for debug purposes of the user-space to retrieve errors related data from previous session (before device reset was performed). Inforamtion is filled when a razwi or CS timeout happens and can contain one of the following: 1. Retrieve timestamp of last time the device was opened and razwi or CS timeout happened. 2. Retrieve information about last CS timeout. 3. Retrieve information about last razwi error. This information doesn't contain user data, so no danger of data leakage between users. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>