Age | Commit message (Collapse) | Author |
|
Remove the optimized strcpy() library implementation. This doesn't make any
difference since gcc recognizes all strcpy() usages anyway and uses the
builtin variant. There is not a single branch to strcpy() within the
generated kernel image, which also seems to be the reason why most other
architectures don't have a strcpy() implementation anymore.
Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Use strscpy() instead of strcpy() so that bounds checking is performed
on the destination buffer. This requires to keep track of the size of
the dynamically allocated prompt memory area, which is done with a new
prompt_sz within struct tty3270.
Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Convert all strcpy() usages to strscpy(). strcpy() is deprecated since
it performs no bounds checking on the destination buffer.
Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Convert all strcpy() usages to strscpy() where the conversion means
just replacing strcpy() with strscpy(). strcpy() is deprecated since
it performs no bounds checking on the destination buffer.
Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Harald Freudenberger says:
====================
This series of patches has the goal to open up a do-not-allocate
memory path from the callers of the pkey in-kernel api down to
the crypto cards and back.
The asynch in-kernel cipher implementations (and the s390 PAES
cipher implementations are one of them) may be called in a
context where memory allocations which trigger IO is not acceptable.
So this patch series reworks the AP bus code, the zcrypt layer,
the pkey layer and the pkey handlers to respect this situation
by processing a new parameter xflags (execution hints flags).
There is a flag PKEY_XFLAG_NOMEMALLOC which tells the code to
not allocate memory which may lead to IO operations.
To reach this goal, the actual code changes have been differed.
The zcrypt misc functions which need memory for cprb build
use a pre allocated memory pool for this purpose. The findcard()
functions have one temp memory area preallocated and protected
with a mutex. Some smaller data is not allocated any more but went
to the stack instead. The AP bus also uses a pre-allocated
memory pool for building AP message requests.
Note that the PAES implementation still needs to get reworked
to run the protected key derivation in a real asynchronous way.
However, this rework of AP bus, zcrypt and pkey is the base work
required before reconsidering the PAES implementation.
The patch series starts bottom (AP bus) and goes up the call
chain (PKEY). At any time in the patch stack it should compile.
For easier review I tried to have one logic code change by
each patch and thus keep the patches "small".
====================
Link: https://lore.kernel.org/r/20250424133619.16495-1-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Add a new parameter xflags to the in-kernel API function
pkey_key2protkey(). Currently there is only one flag supported:
* PKEY_XFLAG_NOMEMALLOC:
If this flag is given in the xflags parameter, the pkey
implementation is not allowed to allocate memory but instead should
fall back to use preallocated memory or simple fail with -ENOMEM.
This flag is for protected key derive within a cipher or similar
which must not allocate memory which would cause io operations - see
also the CRYPTO_ALG_ALLOCATES_MEMORY flag in crypto.h.
The one and only user of this in-kernel API - the skcipher
implementations PAES in paes_s390.c set this flag upon request
to derive a protected key from the given raw key material.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-26-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Provide and pass the xflag parameter from pkey ioctls through
the pkey handler and further down to the implementations
(CCA, EP11, PCKMO and UV). So all the code is now prepared
and ready to support xflags ("execution flag").
The pkey layer supports the xflag PKEY_XFLAG_NOMEMALLOC: If this
flag is given in the xflags parameter, the pkey implementation is
not allowed to allocate memory but instead should fall back to use
preallocated memory or simple fail with -ENOMEM. This flag is for
protected key derive within a cipher or similar which must not
allocate memory which would cause io operations - see also the
CRYPTO_ALG_ALLOCATES_MEMORY flag in crypto.h.
Within the pkey handlers this flag is then to be translated to
appropriate zcrypt xflags before any zcrypt related functions
are called. So the PKEY_XFLAG_NOMEMALLOC translates to
ZCRYPT_XFLAG_NOMEMALLOC - If this flag is set, no memory
allocations which may trigger any IO operations are done.
The pkey in-kernel pkey API still does not provide this xflag
param. That's intended to come with a separate patch which
enables this functionality.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-25-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The uv_get_secret_metadata() in-kernel function was only
offered and used by the pkey uv handler. Remove it as there
is no customer any more.
Suggested-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Acked-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-24-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The pkey uv functions may be called in a situation where memory
allocations which trigger IO operations are not allowed. An example:
decryption of the swap partition with protected key (PAES).
The pkey uv code takes care of this by holding one preallocated
struct uv_secret_list to be used with the new UV function
uv_find_secret(). The older function uv_get_secret_metadata()
used before always allocates/frees an ephemeral memory buffer.
The preallocated struct is concurrency protected by a mutex.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-23-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rename the internal UV function find_secret() to uv_find_secret()
and publish it as new UV API in-kernel function.
The pkey uv handler may be called in a do-not-allocate memory
situation where sleeping is allowed but allocating memory which
may cause IO operations is not. For example when an encrypted
swap file is used and the encryption is done via UV retrievable
secrets with protected keys.
The UV API function uv_get_secret_metadata() allocates memory
and then calls the find_secret() function. By exposing the
find_secret() function as a new UV API function uv_find_secret()
it is possible to retrieve UV secret meta data without any
memory allocations from the UV when the caller offers space
for one struct uv_secret_list.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Acked-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-22-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There have been some places in the EP11 handler code where relatively
small amounts of memory have been allocated an freed at the end
of the function. This code has been reworked to use the stack instead.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-21-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There have been some places in the CCA handler code where relatively
small amounts of memory have been allocated an freed at the end
of the function. This code has been reworked to use the stack instead.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-20-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There are two places in the ep11 misc code where a short term
memory buffer is needed. Rework this code to use the cprb mempool
to satisfy this ephemeral memory requirements.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-19-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Locate the relative small struct ep11_domain_query_info variable
onto the stack instead of kmalloc()/kfree().
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-18-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Propagate the xflags argument from the cca_get_info()
caller down to the lower level functions for proper
memory allocation hints.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-17-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rework two places in the zcrypt cca misc code using kmalloc() for
ephemeral memory allocation. As there is anyway now a cprb mempool
let's use this pool instead to satisfy these short term memory
allocations.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-16-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rework the memory usage of the ep11 findcard() implementation:
- findcard does not allocate memory for the list of apqns
any more.
- the callers are now responsible to provide an array of
apqns to store the matching apqns into.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-15-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rework the memory usage of the cca findcard() implementation:
- findcard does not allocate memory for the list of apqns
any more.
- the callers are now responsible to provide an array of
apqns to store the matching apqns into.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-14-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Remove the caching of the CCA and EP11 card and domain info.
In nearly all places where the card or domain info is fetched
the verify param was enabled and thus the cache was bypassed.
The only real place where info from the cache was used was
in the sysfs pseudo files in cases where the card/queue was
switched to "offline". All other callers insisted on getting
fresh info and thus a communication to the card was enforced.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-13-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The static function findcard() and the zcrypt cca_findcard()
function are both not used any more. Remove this outdated
code and an internal function only called by these.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-12-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Introduce a pre-allocated device status array memory together with
a mutex controlling the occupation to be used by the findcard()
function. Limit the device status array to max 128 cards and max
128 domains to reduce the size of this pre-allocated memory to 64 KB.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-11-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Introduce a pre-allocated device status array memory together with
a mutex controlling the occupation to be used by the findcard2()
function. Limit the device status array to max 128 cards and max
128 domains to reduce the size of this pre-allocated memory to 64 KB.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-10-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rework the existing function zcrypt_device_status_mask_ext():
Add two new parameters to provide upper limits for
cards and queues. The existing implementation needed an
array of 256 * 256 * 4 = 256 KB which is really huge. The
reworked function is more flexible in the sense that the
caller can decide the upper limit for cards and domains to
be stored into the status array. So for example a caller may
decide to only query for cards 0...127 and queues 0...127
and thus only an array of size 128 * 128 * 4 = 64 KB is needed.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-9-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Introduce a cprb mempool for the zcrypt ep11 misc functions
(zcrypt_ep11misc.*) do some preparation rework to support
a do-not-allocate path through some zcrypt ep11 misc functions.
The mempool is controlled by the zcrypt module parameter
"mempool_threshold" which shall control the minimal amount
of memory items for CCA and EP11.
The mempool shall support "mempool_threshold" requests/replies
in parallel which means for EP11 to hold a send and receive
buffer memory per request. Each of this cprb space items is
limited to 8 KB. So by default the mempool consumes
5 * 2 * 8KB = 80KB
If the mempool is depleted upon one ep11 misc functions is
called with the ZCRYPT_XFLAG_NOMEMALLOC xflag set, the function
will fail with -ENOMEM and the caller is responsible for taking
further actions.
This is only part of an rework to support a new xflag
ZCRYPT_XFLAG_NOMEMALLOC but not yet complete.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-8-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Introduce a new module parameter "zcrypt_mempool_threshold"
for the zcrypt module. This parameter controls the minimal
amount of mempool items which are pre-allocated for urgent
requests/replies and will be used with the support for the
new xflag ZCRYPT_XFLAG_NOMEMALLOC. The default value of 5
shall provide enough memory items to support up to 5 requests
(and their associated reply) in parallel. The minimum value
is 1 and is checked in zcrypt module init().
If the mempool is depleted upon one cca misc functions is called
with the named xflag set, the function will fail with -ENOMEM
and the caller is responsible for taking further actions.
For CCA each mempool item is 16KB, as a CCA CPRB needs to
hold the request and the reply. The pool items only support
requests/replies with a limit of about 8KB.
So by default the CCA mempool consumes
5 * 16KB = 80KB
This is only part of an rework to support a new xflag
ZCRYPT_XFLAG_NOMEMALLOC but not yet complete.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-7-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Introduce a new flag parameter for the both cprb send functions
zcrypt_send_cprb() and zcrypt_send_ep11_cprb(). This new
xflags parameter ("execution flags") shall be used to provide
execution hints and flags for this crypto request.
There are two flags implemented to be used with these functions:
* ZCRYPT_XFLAG_USERSPACE - indicates to the lower layers that
all the ptrs address userspace. So when construction the ap msg
copy_from_user() is to be used. If this flag is NOT set, the ptrs
address kernel memory and thus memcpy() is to be used.
* ZCRYPT_XFLAG_NOMEMALLOC - indicates that this task must not
allocate memory which may be allocated with io operations.
For the AP bus and zcrypt message layer this means:
* The ZCRYPT_XFLAG_USERSPACE is mapped to the already existing
bool variable "userspace" which is propagated to the zcrypt
proto implementations.
* The ZCRYPT_XFLAG_NOMEMALLOC results in setting the AP flag
AP_MSG_FLAG_MEMPOOL when the AP msg buffer is initialized.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-6-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
If there is a target list of APQNs given when an CPRB is
to be send via zcrypt_send_ep11_cprb() there is always a
kmalloc() done and the targets are copied via z_copy_from_user.
As there are callers from kernel space (zcrypt_ep11misc.c)
which signal this via the userspace parameter improve this
code to directly use the given target list in case of
kernelspace thus removing the unnecessary memory alloc
and mem copy.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-5-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There is a need for a do-not-allocate-memory path through the AP bus
layer. The pkey layer may be triggered via the in-kernel interface
from a protected key crypto algorithm (namely PAES) to convert a
secure key into a protected key. This happens in a workqueue context,
so sleeping is allowed but memory allocations causing IO operations
are not permitted.
To accomplish this, an AP message memory pool with pre-allocated space
is established. When ap_init_apmsg() with use_mempool set to true is
called, instead of kmalloc() the ap message buffer is allocated from
the ap_msg_pool. This pool only holds a limited amount of buffers:
ap_msg_pool_min_items with the item size AP_DEFAULT_MAX_MSG_SIZE and
exactly one of these items (if available) is returned if
ap_init_apmsg() with the use_mempool arg set to true is called. When
this pool is exhausted and use_mempool is set true, ap_init_apmsg()
returns -ENOMEM without any attempt to allocate memory and the caller
has to deal with that.
Default values for this mempool of ap messages is:
* Each buffer is 12KB (that is the default AP bus size
and all the urgent messages should fit into this space).
* Minimum items held in the pool is 8. This value is adjustable
via module parameter ap.msgpool_min_items.
The zcrypt layer may use this flag to indicate to the ap bus that the
processing path for this message should not allocate memory but should
use pre-allocated memory buffer instead. This is to prevent deadlocks
with crypto and io for example with encrypted swap volumes.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-4-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Slight rework on the way how AP message buffers are allocated.
Instead of having multiple places with kmalloc() calls all
the AP message buffers are now allocated and freed on exactly
one place: ap_init_apmsg() allocates the current AP bus max
limit of ap_max_msg_size (defaults to 12KB). The AP message
buffer is then freed in ap_release_apmsg().
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-3-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Move the very small response_type struct into struct ap_msg.
So there is no need to kmalloc this tiny struct with each
ap message preparation.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-2-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
In CPUMF attribute definitions for z15 all CPUMF attributes
have configuration values of the form 0x0[0-9a-f]{3} .
However 2 defines do not match this scheme, they have two leading
zeroes instead of one. Adjust this. No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The third argument of strscpy() is optional and can be left away iff
the destination is an array and the maximum size of the copy is the
size of destination.
Remove the third argument for those cases where this is possible.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rename strncpy_skip_quote() to strscpy_skip_quote() and change its
implementation so that the destination string is always NUL terminated.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There are hardly any strncpy() users left, therefore drop the
optimized s390 variant.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The s390 specific diag288_wdt watchdog driver makes use of the virtual
watchdog timer, which is available in most machine configurations.
If executing the diagnose instruction with subcode 0x288 results in an
exception the watchdog timer is not available, otherwise it is available.
In order to allow module autoload of the diag288_wdt module, move the
detection of the virtual watchdog timer to early boot code, and provide
its availability as a cpu feature.
This allows to make use of module_cpu_feature_match() to automatically load
the module iff the virtual watchdog timer is available.
Suggested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Tested-by: Mete Durlu <meted@linux.ibm.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20250410095036.1525057-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Replace the last 2 usages of strncpy() in s390 code with strscpy().
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Add a simple sized_strscpy() implementation to allow the use of strscpy()
in the decompressor.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Select ARCH_WANT_IRQS_OFF_ACTIVATE_MM so that activate_mm() is called with
irqs disabled. This allows to call switch_mm_irqs_off() instead of
switch_mm() and saves two local_irq_save() / local_irq_restore() pairs.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Reduce system call overhead time (round trip time for invoking a
non-existent system call) by 25%.
With the removal of set_fs() [1] lazy control register handling was removed
in order to keep kernel entry and exit simple. However this made system
calls slower.
With the conversion to generic entry [2] and numerous follow up changes
which simplified the entry code significantly, adding support for lazy asce
handling doesn't add much complexity to the entry code anymore.
In particular this means:
- On kernel entry the primary asce is not modified and contains the user
asce
- Kernel accesses which require secondary-space mode (for example futex
operations) are surrounded by enable_sacf_uaccess() and
disable_sacf_uaccess() calls. enable_sacf_uaccess() sets the primary asce
to kernel asce so that the sacf instruction can be used to switch to
secondary-space mode. The primary asce is changed back to user asce with
disable_sacf_uaccess().
The state of the control register which contains the primary asce is
reflected with a new TIF_ASCE_PRIMARY bit. This is required on context
switch so that the correct asce is restored for the scheduled in process.
In result address spaces are now setup like this:
CPU running in | %cr1 ASCE | %cr7 ASCE | %cr13 ASCE
-----------------------------|-----------|-----------|-----------
user space | user | user | kernel
kernel (no sacf) | user | user | kernel
kernel (during sacf uaccess) | kernel | user | kernel
kernel (kvm guest execution) | guest | user | kernel
In result cr1 control register content is not changed except for:
- futex system calls
- legacy s390 PCI system calls
- the kvm specific cmpxchg_user_key() uaccess helper
This leads to faster system call execution.
[1] 87d598634521 ("s390/mm: remove set_fs / rework address space handling")
[2] 56e62a737028 ("s390: convert to generic entry")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Properly handle errors when file-backed I/O fails
- Fix compilation issues on ARM platform (arm-linux-gnueabi)
- Fix parsing of encoded extents
- Minor cleanup
* tag 'erofs-for-6.15-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: remove duplicate code
erofs: fix encoded extents handling
erofs: add __packed annotation to union(__le16..)
erofs: set error to bio if file-backed IO fails
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A few more miscellaneous ext4 bug fixes and cleanups including some
syzbot failures and fixing a stale file handing refeencing an inode
previously used as a regular file, but which has been deleted and
reused as an ea_inode would result in ext4 erroneously considering
this a case of fs corruption"
* tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix off-by-one error in do_split
ext4: make block validity check resistent to sb bh corruption
ext4: avoid -Wflex-array-member-not-at-end warning
Documentation: ext4: Add fields to ext4_super_block documentation
ext4: don't treat fhandle lookup of ea_inode as FS corruption
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fix from Mike Rapoport:
"Fix build of memblock test.
Add missing stubs for mutex and free_reserved_area() to memblock
tests"
* tag 'fixes-2025-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock tests: Fix mutex related build error
|
|
Syzkaller detected a use-after-free issue in ext4_insert_dentry that was
caused by out-of-bounds access due to incorrect splitting in do_split.
BUG: KASAN: use-after-free in ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109
Write of size 251 at addr ffff888074572f14 by task syz-executor335/5847
CPU: 0 UID: 0 PID: 5847 Comm: syz-executor335 Not tainted 6.12.0-rc6-syzkaller-00318-ga9cda7c0ffed #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
__asan_memcpy+0x40/0x70 mm/kasan/shadow.c:106
ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109
add_dirent_to_buf+0x3d9/0x750 fs/ext4/namei.c:2154
make_indexed_dir+0xf98/0x1600 fs/ext4/namei.c:2351
ext4_add_entry+0x222a/0x25d0 fs/ext4/namei.c:2455
ext4_add_nondir+0x8d/0x290 fs/ext4/namei.c:2796
ext4_symlink+0x920/0xb50 fs/ext4/namei.c:3431
vfs_symlink+0x137/0x2e0 fs/namei.c:4615
do_symlinkat+0x222/0x3a0 fs/namei.c:4641
__do_sys_symlink fs/namei.c:4662 [inline]
__se_sys_symlink fs/namei.c:4660 [inline]
__x64_sys_symlink+0x7a/0x90 fs/namei.c:4660
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
The following loop is located right above 'if' statement.
for (i = count-1; i >= 0; i--) {
/* is more than half of this entry in 2nd half of the block? */
if (size + map[i].size/2 > blocksize/2)
break;
size += map[i].size;
move++;
}
'i' in this case could go down to -1, in which case sum of active entries
wouldn't exceed half the block size, but previous behaviour would also do
split in half if sum would exceed at the very last block, which in case of
having too many long name files in a single block could lead to
out-of-bounds access and following use-after-free.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Cc: stable@vger.kernel.org
Fixes: 5872331b3d91 ("ext4: fix potential negative array index in do_split()")
Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250404082804.2567-3-a.sadovnikov@ispras.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Block validity checks need to be skipped in case they are called
for journal blocks since they are part of system's protected
zone.
Currently, this is done by checking inode->ino against
sbi->s_es->s_journal_inum, which is a direct read from the ext4 sb
buffer head. If someone modifies this underneath us then the
s_journal_inum field might get corrupted. To prevent against this,
change the check to directly compare the inode with journal->j_inode.
**Slight change in behavior**: During journal init path,
check_block_validity etc might be called for journal inode when
sbi->s_journal is not set yet. In this case we now proceed with
ext4_inode_block_valid() instead of returning early. Since systems zones
have not been set yet, it is okay to proceed so we can perform basic
checks on the blocks.
Suggested-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/0c06bc9ebfcd6ccfed84a36e79147bf45ff5adc1.1743142920.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the `DEFINE_RAW_FLEX()` helper for an on-stack definition of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.
So, with these changes, fix the following warning:
fs/ext4/mballoc.c:3041:40: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/Z-SF97N3AxcIMlSi@kspp
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Documentation and implementation of the ext4 super block have
slightly diverged: Padding has been removed in order to make room for
new fields that are still missing in the documentation.
Add the new fields s_encryption_level, s_first_error_errorcode,
s_last_error_errorcode to the documentation of the ext4 super block.
Fixes: f542fbe8d5e8 ("ext4 crypto: reserve codepoints used by the ext4 encryption feature")
Fixes: 878520ac45f9 ("ext4: save the error code which triggered an ext4_error() in the superblock")
Signed-off-by: Tom Vierjahn <tom.vierjahn@acm.org>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20250324221004.5268-1-tom.vierjahn@acm.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Hide get_vm_area() from MMUless builds
The function get_vm_area() is not defined when CONFIG_MMU is not
defined. Hide that function within #ifdef CONFIG_MMU.
- Fix output of synthetic events when they have dynamic strings
The print fmt of the synthetic event's format file use to have "%.*s"
for dynamic size strings even though the user space exported
arguments had only __get_str() macro that provided just a nul
terminated string. This was fixed so that user space could parse this
properly.
But the reason that it had "%.*s" was because internally it provided
the maximum size of the string as one of the arguments. The fix that
replaced "%.*s" with "%s" caused the trace output (when the kernel
reads the event) to write "(efault)" as it would now read the length
of the string as "%s".
As the string provided is always nul terminated, there's no reason
for the internal code to use "%.*s" anyway. Just remove the length
argument to match the "%s" that is now in the format.
- Fix the ftrace subops hash logic of the manager ops hash
The function_graph uses the ftrace subops code. The subops code is a
way to have a single ftrace_ops registered with ftrace to determine
what functions will call the ftrace_ops callback. More than one user
of function graph can register a ftrace_ops with it. The function
graph infrastructure will then add this ftrace_ops as a subops with
the main ftrace_ops it registers with ftrace. This is because the
functions will always call the function graph callback which in turn
calls the subops ftrace_ops callbacks.
The main ftrace_ops must add a callback to all the functions that the
subops want a callback from. When a subops is registered, it will
update the main ftrace_ops hash to include the functions it wants.
This is the logic that was broken.
The ftrace_ops hash has a "filter_hash" and a "notrace_hash" where
all the functions in the filter_hash but not in the notrace_hash are
attached by ftrace. The original logic would have the main ftrace_ops
filter_hash be a union of all the subops filter_hashes and the main
notrace_hash would be a intersect of all the subops filter hashes.
But this was incorrect because the notrace hash depends on the
filter_hash it is associated to and not the union of all
filter_hashes.
Instead, when a subops is added, just include all the functions of
the subops hash that are in its filter_hash but not in its
notrace_hash. The main subops hash should not use its notrace hash,
unless all of its subops hashes have an empty filter_hash (which
means to attach to all functions), and then, and only then, the main
ftrace_ops notrace hash can be the intersect of all the subops
hashes.
This not only fixes the bug, but also simplifies the code.
- Add a selftest to better test the subops filtering
Add a selftest that would catch the bug fixed by the above change.
- Fix extra newline printed in function tracing with retval
The function parameter code changed the output logic slightly and
called print_graph_retval() and also printed a newline. The
print_graph_retval() also prints a newline which caused blank lines
to be printed in the function graph tracer when retval was added.
This caused one of the selftests to fail if retvals were enabled.
Instead remove the new line output from print_graph_retval() and have
the callers always print the new line so that it doesn't have to do
special logic if it calls print_graph_retval() or not.
- Fix out-of-bound memory access in the runtime verifier
When rv_is_container_monitor() is called on the last entry on the
link list it references the next entry, which is the list head and
causes an out-of-bound memory access.
* tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix out-of-bound memory access in rv_is_container_monitor()
ftrace: Do not have print_graph_retval() add a newline
tracing/selftest: Add test to better test subops filtering of function graph
ftrace: Fix accounting of subop hashes
ftrace: Properly merge notrace hashes
tracing: Do not add length to print format in synthetic events
tracing: Hide get_vm_area() from MMUless builds
|
|
Pull bpf fixes from Alexei Starovoitov:
- Followup fixes for resilient spinlock (Kumar Kartikeya Dwivedi):
- Make res_spin_lock test less verbose, since it was spamming BPF
CI on failure, and make the check for AA deadlock stronger
- Fix rebasing mistake and use architecture provided
res_smp_cond_load_acquire
- Convert BPF maps (queue_stack and ringbuf) to resilient spinlock
to address long standing syzbot reports
- Make sure that classic BPF load instruction from SKF_[NET|LL]_OFF
offsets works when skb is fragmeneted (Willem de Bruijn)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Convert ringbuf map to rqspinlock
bpf: Convert queue_stack map to rqspinlock
bpf: Use architecture provided res_smp_cond_load_acquire
selftests/bpf: Make res_spin_lock AA test condition stronger
selftests/net: test sk_filter support for SKF_NET_OFF on frags
bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags
selftests/bpf: Make res_spin_lock test less verbose
|
|
When rv_is_container_monitor() is called on the last monitor in
rv_monitors_list, KASAN yells:
BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110
Read of size 8 at addr ffffffff97c7c798 by task setup/221
The buggy address belongs to the variable:
rv_monitors_list+0x18/0x40
This is due to list_next_entry() is called on the last entry in the list.
It wraps around to the first list_head, and the first list_head is not
embedded in struct rv_monitor_def.
Fix it by checking if the monitor is last in the list.
Cc: stable@vger.kernel.org
Cc: Gabriele Monaco <gmonaco@redhat.com>
Fixes: cb85c660fcd4 ("rv: Add option for nested monitors and include sched")
Link: https://lore.kernel.org/e85b5eeb7228bfc23b8d7d4ab5411472c54ae91b.1744355018.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|