Age | Commit message (Collapse) | Author |
|
when an undefined opcode error occurres, the driver collects
the relevant information from the Qman and stores it inside
the hdev data structure. An event fd indication is sent towards the
user space.
Note: another commit shall be followed which will add support to
read the error info by an ioctl.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Due to code changes in the past few years, the original comment of
how parser->user_cb_size is checked was not correct anymore.
Fix it to reflect current code and add more explanation as the code
is more complex now.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
positive flags naming will make more clear code while adding
more 'error info' structures
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
raising the tpc assert event in an internal function will make
the code cleaner as we are going to be adding more events
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Arrays of struct attribute are expected to be NULL terminated.
This is required by API methods such as device_add_groups.
This fixes a crash when loading the driver for Goya device.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Create separate info structure for each error type.
The structures shall be used inside the large structure that contains
the last session error.
This is more scalable for adding more errors in the future.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When user requests to prefetch the MMU translations, the driver will
not block the user until prefetch is done.
Instead, the prefetch work will be delegated to a WQ which will do it
in the background.
This way, the prefetch may progress without blocking the user at all.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The driver will be able to send notification events towards
a user process, using user's registered event file descriptor.
The driver uses the notification mechanism to inform the
user about an occurred event.
A user thread can wait until a notification is received from
the driver.
The driver stores the occurred event until the user reads it,
using HL_INFO_GET_EVENTS - new ioctl opcode in the INFO ioctl.
Gaudi specific implementation includes sending a notification
on a TPC assertion event that is received from f/w.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add the ability to scrub the device memory with a given value.
Add file 'dram_mem_scrub_val' to set the value
and a file 'dram_mem_scrub' to scrub the dram.
This is very important to help during automated tests, when you want
the CI system to randomize the memory before training certain
DL topologies.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
With the new code required for the flow added, we can now switch
to using the new memory manager infrastructure, removing the old code.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In certain workloads, arbitration timeout might expire although
no actual issue present. Hence, we set timeout to a very high value.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Instead of using for_each_sg when iterating sgt that contains dma
entries, use the more proper for_each_sgtable_dma_sg macro.
In addition, both Goya and Gaudi have the exact same implementation
of the asic function that encapsulate the usage of this macro, so
it is better to move that implementation to the common code.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Use standard kernel macro to take lower 32 bits of 64-bits variable.
Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Currently we have two reset prints per reset. One is in the common
code and one in each asic-specific file.
We can change the asic-specific message to be debug only as we can
know the type of reset being done according to the print in the
common code, which is also easier to maintain.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Halting compute engines is a print that doesn't add us any information
because it is always done in the reset process and not used elsewhere.
Even if it was, we don't use prints to mark functions we passed
through.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The debugfs memory access now uses the callback 'access_dev_mem'
so there is no use of the callbacks
'debugfs_{read32,read64,write32,write6}'. Remove them.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This is a preparation for unifying the code of accessing device memory
through debugfs. Add struct fields and callbacks that will later
be used in debugfs code and will reduce code duplication
among the different read{32,64}/write{32,64} callbacks of
every asic.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When Gaudi device is secured the monitors data in the configuration
space is blocked from PCI access.
As we need to enable user to get sync-manager monitors registers when
debugging, this patch adds a debugfs that dumps the information to a
binary file (blob).
When a root user will trigger the dump, the driver will send request to
the f/w to fill a data structure containing dump of all monitors
registers.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We currently allow accessing the whole SRAM bar size with
the macro SRAM_BAR_SIZE, but the actual size of the sram
region is the macro SRAM_SIZE which is only a portion of
the whole bar size. So when accessing the sram through
debugfs, use the macro SRAM_SIZE for the sram size
which is the correct macro.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This is necessary pre-requisite for future ASIC support, where MMU
TLB prefetch is supported.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The required DMA mask is no longer based on input from the F/W, but it
is fixed per ASIC according to its address space.
As such, the per-ASIC function to get this value can be replaced with a
property variable.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The default max power is deduced from the card type value in the CPU-CP
info. This value is then set in the max power variable of the device
structure.
Getting the CPU-CP info is done as part of the late init phase
which is called also during reset. This means that a max power value
which is modified via sysfs will be reset during hard reset back to the
default value.
As the max power is updated in any case during device init in
hl_sysfs_init(), this setting in late init can be removed, and the
overriding during reset is thus avoided.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Future devices will support multiple device memory page sizes.
In addition, an API for the user was added for it to be able to control
the device memory allocation page size.
This patch is a complementary patch to inform the user of the available
page size supported by the device.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
There is no need to hold each MMU mask/shift as a denoted structure
member (e.g. hop0_mask).
Instead converting it to array will result in smaller and more readable
code.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This patch breaks the cumbersome implementation of "get real page size"
along with it's multiple inner conditions and implement each case
(according to the real complexity) inside an ASIC function.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Looking forward we will need to report to the user what is the default
page size used.
This will be done more conveniently by explicitly updating the property
rather than to rely on a "0 meaning default" value.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Several H/W events can be sent adjacently, even due to a single error.
If a hard-reset is triggered as part of handling one of these events,
the following events won't be handled.
The debug info from these missed events is important, sometimes even
more important than the one that was handled.
To allow handling these close events, add an option to delay a device
reset and use it when resetting due to H/W events.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
There are a few events that can arrive from the f/w and without proper
handling can cause errors to appear in the kernel log without reason.
Add the relevant handling that was missing.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Various AXI errors can occur in the NIC engines and are reported to
the driver by the f/w. Add code to print the errors and ack them to
the f/w.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In future ASICs the MMU will be able to work with multiple page sizes,
thus a new flag is added to allow the user to set the requested page
size.
This flag is added since the whole DRAM is allocated for the user and
the user also should be familiar with the memory usage use case.
As such, the user may choose to "over allocate" memory in favor of
performance (for instance- large page allocations covers more memory
in less TLB entries).
For example: say available page sizes are of 1MB and 32MB. If user
wants to allocate 40MB the user can either set page size to 1MB and
allocate the exact amount of memory (but will result in 40 TLB entries)
or the user can use 32MB pages, "waste" 8MB of physical memory but
occupy only 2 TLB entries.
Note that this feature will be available only to ASIC that supports
multiple DRAM page sizes.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
For current devices there is a need to send the max power value to F/W
during device init, for example because there might be several card
types.
In future devices, this info will be programmed in the device's EEPROM
and will be read by F/W, and hence the driver should not send it.
Modify the sending of the relevant message to be done only for ASIC
types that need it.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
On Goya and Gaudi, the stop-on-error configuration can be set via
debugfs. However, in future devices, this configuration will always be
enabled.
Modify the debugfs node to be allowed only for ASICs that support this
dynamic configuration.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order to support several device MMU blocks with different
architectures (e.g. different HOP table size) we need to move to
per-MMU properties rather than keeping those properties as ASIC
properties.
Refactoring the code to use "per-MMU proprties" is a major effort.
To start making the transition towards this goal but still support
taking the properties from ASIC properties (for code that currently
uses them) this patch copies some of the properties to the "per-MMU"
properties and later, when implementing the per-MMU properties, we
would be able to delete the MMU props from the ASIC props.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
We have a common function that wraps the call to the MMU cache
invalidation function, which is ASIC-specific. The wrapper checks
the return value and prints error if necessary. For consistency, try
to use the wrapper when possible.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
We don't need this workaround anymore.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
infineon version is only applicable to GOYA and GAUDI. For later
ASICs, we display the Voltage Regulator Monitor f/w version.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Setting PLL profile is the same for all ASICs, except for GOYA.
However, because this function is never called from common code, there
is no need to have an asic-specific callback function.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The two remaining functions in this file belong to firmware_if.c,
as they communicate with the firmware.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Retrieving the clock from the f/w is done exactly the same in ALL our
ASICs. Therefore, no real justification for doing it as an
ASIC-specific function.
The only thing is we need to check if we are running on simulator,
which doesn't require ASIC-specific callback.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Move common sysfs store/show functions to sysfs.c file for
consistency.
This is part of a patch-set to remove hwmgr.c
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Now that clock gating is permanently disabled in GAUDI, no need for
the ASIC functions of setting and disabling clock gating, as this
was a unique scenario in GAUDI.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Due to the need of SynapseAI to configure all TPC engines from a single
QMAN, the driver must disable CGM and never allow the user to enable
it. Otherwise, the configuration of the TPC engines will fail.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Unify variables related to device reset, which will help us to
add some new reset functionality in future patches.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
During the MMU development the MMU header files were left with unclean
definitions:
- MMU "version specific" definitions that were left in the mmu_general
file
- unused definitions
This patch attempts, where possible, to keep definitions that can serve
multiple MMU versions (but that are not tightly bound with specific MMU
arch) in the mmu_general header file (e.g. different definitions for
number of HOPs).
Otherwise, move MMU version specific definitions (e.g. HOPs masks and
shifts) to the specific MMU version file.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
GAUDI supports only hard-reset. Therefore, this function should
return an error of operation not permitted.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The ASIC-specific soft_reset_late_init() is now called after either
soft-reset or reset-upon-device-release. Therefore, it needs a more
appropriate name.
No need to split it to two functions, as an ASIC either supports
soft-reset or reset-upon-device-release.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Changing the frequency automatically is only done in Goya. In future
ASICs this is done inside the firmware. Therefore, move the common code
into the Goya specific files.
Main changes as part of the commit are:
1. The thread for setting frequency is moved from device_late_init
to goya_late_init
2. hl_device_set_frequency is removed from hl_device_open as it is
not relevant for other ASICs and for Goya it is taken care by
the thread
3. hl_device_set_frequency is renamed as goya_set_frequency
Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Currently there is a deadlock in driver in scenarios where MMU
cache invalidation fails. The issue is basically device reset
being performed without releasing the MMU mutex.
The solution is to skip device reset as it is not necessary.
In addition we introduce a slight code refactor that prints the
invalidation error from a single location.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Getting the used PLL index with which to send the CPUPU packet relies on
the CPUCP info packet.
In case CPU queues are not enabled getting the PLL index will issue an
error and in some ASICs will also fail the driver load.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
A new uAPI is added for debug purposes of the user-space to retrieve
errors related data from previous session (before device reset was
performed).
Inforamtion is filled when a razwi or CS timeout happens and can
contain one of the following:
1. Retrieve timestamp of last time the device was opened and razwi or
CS timeout happened.
2. Retrieve information about last CS timeout.
3. Retrieve information about last razwi error.
This information doesn't contain user data, so no danger of data
leakage between users.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|