Age | Commit message (Collapse) | Author |
|
Assign netdev.dev_port based on the device channel index, to indicate the
port number of the network device.
While this driver already uses netdev.dev_id for that purpose, dev_port is
more appropriate. However, retain dev_id to avoid potential regressions.
Fixes: 3e66d0138c05 ("can: populate netdev::dev_id for udev discrimination")
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123452.41-4-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add support for ethtool set_phys_id(), to physically locate devices by
flashing a LED on the device.
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123452.41-3-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add support to turn on/off CAN LEDs on device.
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123452.41-2-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
CAN interfaces"
Jimmy Assarsson <extja@kvaser.com> says:
This patch series simplifies the process of identifying which network
interface (can0..canX) corresponds to which physical CAN channel on
Kvaser PCIe based CAN interfaces.
Changes in v4:
- Fix transient Sparse warning
- Add tag Reviewed-by Vincent Mailhol
Changes in v3:
- Fixed typo; kvaser_pcied -> kvaser_pciefd in documentation patch
Changes in v2:
- Replace use of netdev.dev_id with netdev.dev_port
- Formatting and refactoring
- New patch with devlink documentation
Link: https://patch.msgid.link/20250725123230.8-1-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
List the version information reported by the kvaser_pciefd driver
through devlink.
Suggested-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-11-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Register each CAN channel of the device as an devlink physical port.
This makes it easier to get device information for a given network
interface (i.e. can2).
Example output:
$ devlink dev
pci/0000:07:00.0
pci/0000:08:00.0
pci/0000:09:00.0
$ devlink port
pci/0000:07:00.0/0: type eth netdev can0 flavour physical port 0 splittable false
pci/0000:07:00.0/1: type eth netdev can1 flavour physical port 1 splittable false
pci/0000:07:00.0/2: type eth netdev can2 flavour physical port 2 splittable false
pci/0000:07:00.0/3: type eth netdev can3 flavour physical port 3 splittable false
pci/0000:08:00.0/0: type eth netdev can4 flavour physical port 0 splittable false
pci/0000:08:00.0/1: type eth netdev can5 flavour physical port 1 splittable false
pci/0000:09:00.0/0: type eth netdev can6 flavour physical port 0 splittable false
pci/0000:09:00.0/1: type eth netdev can7 flavour physical port 1 splittable false
pci/0000:09:00.0/2: type eth netdev can8 flavour physical port 2 splittable false
pci/0000:09:00.0/3: type eth netdev can9 flavour physical port 3 splittable false
$ devlink port show can2
pci/0000:07:00.0/2: type eth netdev can2 flavour physical port 2 splittable false
$ devlink dev info
pci/0000:07:00.0:
driver kvaser_pciefd
versions:
running:
fw 1.3.75
pci/0000:08:00.0:
driver kvaser_pciefd
versions:
running:
fw 2.4.29
pci/0000:09:00.0:
driver kvaser_pciefd
versions:
running:
fw 1.3.72
$ sudo ethtool -i can2
driver: kvaser_pciefd
version: 6.8.0-40-generic
firmware-version: 1.3.75
expansion-rom-version:
bus-info: 0000:07:00.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-10-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Expose device firmware version via devlink info_get().
Example output:
$ devlink dev
pci/0000:07:00.0
pci/0000:08:00.0
pci/0000:09:00.0
$ devlink dev info
pci/0000:07:00.0:
driver kvaser_pciefd
versions:
running:
fw 1.3.75
pci/0000:08:00.0:
driver kvaser_pciefd
versions:
running:
fw 2.4.29
pci/0000:09:00.0:
driver kvaser_pciefd
versions:
running:
fw 1.3.72
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-9-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add devlink support at device level.
Example output:
$ devlink dev
pci/0000:07:00.0
pci/0000:08:00.0
pci/0000:09:00.0
$ devlink dev info
pci/0000:07:00.0:
driver kvaser_pciefd
pci/0000:08:00.0:
driver kvaser_pciefd
pci/0000:09:00.0:
driver kvaser_pciefd
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-8-extja@kvaser.com
[mkl: kvaser_pciefd_remove(): fix use-after-free]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Split driver into C-file and header-file, to simplify future patches.
Move common definitions and declarations to a header file.
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-7-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Store device channel index in netdev.dev_port.
Fixes: 26ad340e582d ("can: kvaser_pciefd: Add driver for Kvaser PCIEcan devices")
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-6-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Store firmware version in kvaser_pciefd_fw_version struct, specifying the
different components of the version number.
And drop debug prinout of firmware version, since later patches will expose
it via the devlink interface.
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-5-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add intermediate variable, for readability and to simplify future patches.
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-4-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add support for ethtool set_phys_id(), to physically locate devices by
flashing a LED on the device.
Reviewed-by: Axel Forsman <axfo@kvaser.com>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-3-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add support to turn on/off CAN LEDs on device.
Turn off all CAN LEDs in probe, since they are default on after a reset or
power on.
Reviewed-by: Axel Forsman <axfo@kvaser.com>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-2-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Pull block fix from Jens Axboe:
"Just a single fix for regression in this release, where a module
reference could be leaked"
* tag 'block-6.16-20250725' of git://git.kernel.dk/linux:
block: fix module reference leak in mq-deadline I/O scheduler
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"Two last-minute fixes for this cycle:
- Set afs vllist to NULL if addr parsing fails
- Add a missing check for reaching the end of the string in afs"
* tag 'vfs-6.16-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Set vllist to NULL if addr parsing fails
afs: Fix check for NULL terminator
|
|
Pull bcachefs fixes from Kent Overstreet:
"User reported fixes:
- Fix btree node scan on encrypted filesystems by not using btree
node header fields encrypted
- Fix a race in btree write buffer flush; this caused EROs primarily
during fsck for some people"
* tag 'bcachefs-2025-07-24' of git://evilpiepirate.org/bcachefs:
bcachefs: Add missing snapshots_seen_add_inorder()
bcachefs: Fix write buffer flushing from open journal entry
bcachefs: btree_node_scan: don't re-read before initializing found_btree_node
|
|
Commit e7607f7d6d81 ("ARM: 9443/1: Require linker to support KEEP within
OVERLAY for DCE") accidentally broke the binutils version restriction
that was added in commit 0d437918fb64 ("ARM: 9414/1: Fix build issue
with LD_DEAD_CODE_DATA_ELIMINATION"), reintroducing the segmentation
fault addressed by that workaround.
Restore the binutils version dependency by using
CONFIG_LD_CAN_USE_KEEP_IN_OVERLAY as an additional condition to ensure
that CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION is only enabled with
binutils >= 2.36 and ld.lld >= 21.0.0.
Closes: https://lore.kernel.org/6739da7d-e555-407a-b5cb-e5681da71056@landley.net/
Closes: https://lore.kernel.org/CAFERDQ0zPoya5ZQfpbeuKVZEo_fKsonLf6tJbp32QnSGAtbi+Q@mail.gmail.com/
Cc: stable@vger.kernel.org
Fixes: e7607f7d6d81 ("ARM: 9443/1: Require linker to support KEEP within OVERLAY for DCE")
Reported-by: Rob Landley <rob@landley.net>
Tested-by: Rob Landley <rob@landley.net>
Reported-by: Martin Wetterwald <martin@wetterwald.eu>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
After commit d5c8d6e0fa61 ("kbuild: Update assembler calls to use proper
flags and language target"), which updated as-instr to use the
'assembler-with-cpp' language option, the Kbuild version of as-instr
always fails internally for arch/arm with
<command-line>: fatal error: asm/unified.h: No such file or directory
compilation terminated.
because '-include' flags are now taken into account by the compiler
driver and as-instr does not have '$(LINUXINCLUDE)', so unified.h is not
found.
This went unnoticed at the time of the Kbuild change because the last
use of as-instr in Kbuild that arch/arm could reach was removed in 5.7
by commit 541ad0150ca4 ("arm: Remove 32bit KVM host support") but a
stable backport of the Kbuild change to before that point exposed this
potential issue if one were to be reintroduced.
Follow the general pattern of '-include' paths throughout the tree and
make unified.h absolute using '$(srctree)' to ensure KBUILD_AFLAGS can
be used independently.
Closes: https://lore.kernel.org/CACo-S-1qbCX4WAVFA63dWfHtrRHZBTyyr2js8Lx=Az03XHTTHg@mail.gmail.com/
Cc: stable@vger.kernel.org
Fixes: d5c8d6e0fa61 ("kbuild: Update assembler calls to use proper flags and language target")
Reported-by: KernelCI bot <bot@kernelci.org>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data()
when an inode had the INLINE_DATA_FL flag set but was missing the
system.data extended attribute.
Since this can happen due to a maiciouly fuzzed file system, we
shouldn't BUG, but rather, report it as a corrupted file system.
Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii
ext4_create_inline_data() and ext4_inline_data_truncate().
Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Although we now perform ordered traversal within an xarray, this is
currently limited to a single xarray. However, we have multiple such
xarrays, which prevents us from guaranteeing a linear-like traversal
where all groups on the right are visited before all groups on the left.
For example, suppose we have 128 block groups, with a target group of 64,
a target length corresponding to an order of 1, and available free groups
of 16 (order 1) and group 65 (order 8):
For linear traversal, when no suitable free block is found in group 64, it
will search in the next block group until group 127, then start searching
from 0 up to block group 63. It ensures continuous forward traversal, which
is consistent with the unidirectional rotation behavior of HDD platters.
Additionally, the block group lock contention during freeing block is
unavoidable. The goal increasing from 0 to 64 indicates that previously
scanned groups (which had no suitable free space and are likely to free
blocks later) and skipped groups (which are currently in use) have newly
freed some used blocks. If we allocate blocks in these groups, the
probability of competing with other processes increases.
For non-linear traversal, we first traverse all groups in order_1. If only
group 16 has free space in this list, we first traverse [63, 128), then
traverse [0, 64) to find the available group 16, and then allocate blocks
in group 16. Therefore, it cannot guarantee continuous traversal in one
direction, thus increasing the probability of contention.
So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range()
to only traverse a fixed range of groups, and move the logic for handling
wrap around to the caller. The caller first iterates through all xarrays
in the range [start, ngroups) and then through the range [0, start). This
approach simulates a linear scan, which reduces contention between freeing
blocks and allocating blocks.
Assume we have the following groups, where "|" denotes the xarray traversal
start position:
order_1_groups: AB | CD
order_2_groups: EF | GH
Traversal order:
Before: C > D > A > B > G > H > E > F
After: C > D > G > H > A > B > E > F
Performance test data follows:
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19555 | 20049 (+2.5%) | 315636 | 316724 (-0.3%) |
|mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53192 | 52125 (-2.0%) | 212678 | 215136 (+1.1%) |
|mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.
This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.
Key changes:
* ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups().
* Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear
traversals, and related functions now return error codes instead of
group info.
* Added ext4_mb_scan_groups_linear() for performing linear scans starting
from a specific group for a set number of times.
* Linear scans now execute up to sbi->s_mb_max_linear_groups times,
so ac_groups_linear_remaining is removed as it's no longer used.
* ac->ac_criteria is now used directly instead of passing cr around.
Also, ac->ac_criteria is incremented directly after groups scan fails
for the corresponding criteria.
* Since we're now directly scanning groups instead of finding a good group
then scanning, the following variables and flags are no longer needed,
s_bal_cX_groups_considered is sufficient.
s_bal_p2_aligned_bad_suggestions
s_bal_goal_fast_bad_suggestions
s_bal_best_avail_bad_suggestions
EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED
EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED
EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.
In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.
Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.
In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.
While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.
Additionally, xarray insertions might fail, potentially due to memory
allocation issues. However, since we have linear traversal as a fallback,
this isn't a major problem. Therefore, we've only added a warning message
for insertion failures here.
A helper function ext4_mb_find_good_group_xarray() is added to find good
groups in the specified xarray starting at the specified position start,
and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
This ensures an ordered traversal within the xarray.
Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20097 | 19555 (-2.6%) | 316141 | 315636 (-0.2%) |
|mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53603 | 53192 (-0.7%) | 214243 | 212678 (-0.7%) |
|mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
[ Applied spelling fixes per discussion on the ext4-list see thread
referened in the Link tag. --tytso]
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Extract ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Extract ext4_mb_might_prefetch() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.
For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.
At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.
To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.
This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.
Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.
For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:
prev merge into new and free prev // hold lock
next merge into new and free next // hold lock
insert new // hold lock
After the patch, it only needs to be acquired once:
new merge into next and free new // no lock
next merge into prev and free next // hold lock
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20043 | 20097 (+0.2%) | 314331 | 316141 (+0.5%) |
|mb_optimize_scan=1 | 7290 | 13318 (+87.4%) | 324226 | 325273 (+0.3%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 54999 | 53603 (-2.5%) | 214380 | 214243 (-0.06%)|
|mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.
Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%) | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129 | 7290 (+2.2%) | 321275 | 324226 (+0.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%) | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%) | 215262 | 216276 (+0.4%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove the superfluous "find_".
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.
To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.
Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
fails to achieve the inode goal, allocation continues with the stream
allocation global goal. Currently, hits for both are combined in
sbi->s_bal_goals, hindering accurate optimization.
This commit separates global goal hits into sbi->s_bal_stream_goals. Since
stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
This prevents stream allocations from being counted in s_bal_goals. Also
clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.
After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:
mballoc:
reqs: 840347
success: 750992
groups_scanned: 1230506
cr_p2_aligned_stats:
hits: 21531
groups_considered: 411664
extents_scanned: 21531
useless_loops: 0
bad_suggestions: 6
cr_goal_fast_stats:
hits: 111222
groups_considered: 1806728
extents_scanned: 467908
useless_loops: 0
bad_suggestions: 13
cr_best_avail_stats:
hits: 36267
groups_considered: 1817631
extents_scanned: 156143
useless_loops: 0
bad_suggestions: 204
cr_goal_slow_stats:
hits: 106396
groups_considered: 5671710
extents_scanned: 22540056
useless_loops: 123747
cr_any_free_stats:
hits: 138071
groups_considered: 724692
extents_scanned: 23615593
useless_loops: 585
extents_scanned: 46804261
goal_hits: 1307
stream_goal_hits: 236317
len_goal_hits: 155549
2^n_hits: 21531
breaks: 225096
lost: 35062
buddies_generated: 40/40
buddies_time_used: 48004
preallocated: 5962467
discarded: 4847560
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.
But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.
Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.
Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 |
|Memory: 512GB |-------------------------|
|960GB SSD (0.5GB/s)| base | patched |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 2667 | 4821 (+80.7%) |
|mb_optimize_scan=1 | 2643 | 4784 (+81.0%) |
|CPU: AMD 9654 * 2 | P96 |
|Memory: 1536GB |-------------------------|
|960GB SSD (1GB/s) | base | patched |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 3450 | 15371 (+345%) |
|mb_optimize_scan=1 | 3209 | 6101 (+90.0%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Various changes in the "ext4: better scalability for ext4 block
allocation" patch series have resulted in kunit test failures, most
notably in the test_new_blocks_simple and the test_mb_mark_used tests.
The root cause of these failures is that various in-memory ext4 data
structures were not getting initialized, and while previous versions
of the functions exercised by the unit tests didn't use these
structure members, this was arguably a test bug.
Since one of the patches in the block allocation scalability patches
is a fix which is has a cc:stable tag, this commit also has a
cc:stable tag.
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com
Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com
Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The field 'reacting' in struct rv_monitor is set but never used. Delete it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/a6c16f845d2f1a09c4d0934ab83f3cb14478a71d.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
rv_reactor has a reference counter to ensure it is not removed while
monitors are still using it.
However, this is futile, as __exit functions are not expected to fail and
will proceed normally despite rv_unregister_reactor() returning an error.
At the moment, reactors do not support being built as modules, therefore
they are never removed and the reference counters are not necessary.
If we support building RV reactors as modules in the future, kernel
module's centralized facilities such as try_module_get(), module_put() or
MODULE_SOFTDEP should be used instead of this custom implementation.
Remove this reference counter.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/bb946398436a5e17fb0f5b842ef3313c02291852.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Each struct rv_reactor has a unique struct rv_reactor_def associated with
it. struct rv_reactor is statically allocated, while struct rv_reactor_def
is dynamically allocated.
This makes the code more complicated than it should be:
- Lookup is required to get the associated rv_reactor_def from rv_reactor
- Dynamic memory allocation is required for rv_reactor_def. This is
harder to get right compared to static memory. For instance, there is
an existing mistake: rv_unregister_reactor() does not free the memory
allocated by rv_register_reactor(). This is fortunately not a real
memory leak problem as rv_unregister_reactor() is never called.
Simplify and merge rv_reactor_def into rv_reactor.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/71cb91c86cd40df5b8c492b788787f2a73c3eaa3.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Each struct rv_monitor has a unique struct rv_monitor_def associated with
it. struct rv_monitor is statically allocated, while struct rv_monitor_def
is dynamically allocated.
This makes the code more complicated than it should be:
- Lookup is required to get the associated rv_monitor_def from rv_monitor
- Dynamic memory allocation is required for rv_monitor_def. This is
harder to get right compared to static memory. For instance, there is
an existing mistake: rv_unregister_monitor() does not free the memory
allocated by rv_register_monitor(). This is fortunately not a real
memory leak problem, as rv_unregister_monitor() is never called.
Simplify and merge rv_monitor_def into rv_monitor.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/194449c00f87945c207aab4c96920c75796a4f53.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
rv_monitor_def::task_monitor is not used. Delete it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/502d94f2696435690a2b1fdbe80a9e56c96fcabf.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The events hugepage_set_pmd, hugepage_set_pud, hugepage_update_pmd and
hugepage_update_pud are only called when CONFIG_PPC_BOOK3S_64 is defined.
As each event can take up to 5K regardless if they are used or not, it's
best not to define them when they are not used. Add #ifdef around these
events when they are not used.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/20250612101259.0ad43e48@batman.local.home
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The MTD device is blocked from writing to the SPI-NOR chip if any region
of it is write-protected, even if "writeable=1" module parameter is set.
Add ability to bypass this behaviour by introducing new module parameter
"ignore_protestion_status" which allows to rely on the write protection
mechanism of SPI-NOR chip itself, which most modern chips (since
the 1990'+) have already implemented.
Any erase/write operations performed on the write-protected section will
be rejected by the chip.
Signed-off-by: Jakub Czapiga <czapiga@google.com>
Link: https://patch.msgid.link/20250725122542.2633334-1-czapiga@google.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Document that using the MMIO helpers will automatically enable the
'fast_io' parameter. This makes the used locking scheme more transparent
and avoids superfluous setting of this parameter in drivers.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://patch.msgid.link/20250725110337.4303-2-wsa+renesas@sang-engineering.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The kmemleak reports memory leaks related to elevator resources that
were originally allocated in the ->init_hctx() method. The following
leak traces are observed after running blktests block/040:
unreferenced object 0xffff8881b82f7400 (size 512):
comm "check", pid 68454, jiffies 4310588881
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 5bac8b34):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x419/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
0xffffffffc09ceb80
0xffffffffc09d7e0b
configfs_write_iter+0x2b1/0x470
vfs_write+0x527/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f6000 (size 512):
comm "check", pid 68454, jiffies 4310588881
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 5bac8b34):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x419/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
0xffffffffc09ceb80
0xffffffffc09d7e0b
configfs_write_iter+0x2b1/0x470
vfs_write+0x527/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f5800 (size 512):
comm "check", pid 68454, jiffies 4310588881
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 5bac8b34):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x419/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
0xffffffffc09ceb80
0xffffffffc09d7e0b
configfs_write_iter+0x2b1/0x470
vfs_write+0x527/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The issue arises while we run nr_hw_queue update, Specifically, we first
reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
then later invoke elevator_switch() (assuming q->elevator is not NULL).
The elevator switch code would first exit old elevator (elevator_exit)
and then switches to the new elevator. The elevator_exit loops through
each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
which releases resources allocated during ->init_hctx().
This memleak manifests when we reduce the num of h/w queues - for example,
when the initial update sets the number of queues to X, and a later update
reduces it to Y, where Y < X. In this case, we'd loose the access to old
hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
would have already released the old hctxs. As we don't now have any
reference left to the old hctxs, we don't have any way to free the
scheduler resources (which are allocate in ->init_hctx()) and kmemleak
complains about it.
This issue was caused due to the commit 596dce110b7d ("block: simplify
elevator reattachment for updating nr_hw_queues"). That change unified
the two-stage elevator teardown and reattachment into a single call that
occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
This patch restores the previous two-stage elevator switch logic during
nr_hw_queues updates. First, the elevator is switched to 'none', which
ensures all scheduler resources are properly freed. Then, the hardware
contexts (hctxs) are reallocated, and the software-to-hardware queue
mappings are updated. Finally, the original elevator is reattached. This
sequence prevents loss of references to old hctxs and avoids the scheduler
resource leaks reported by kmemleak.
Reported-by : Yi Zhang <yi.zhang@redhat.com>
Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250724102540.1366308-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Testing by the syzbot fuzzer showed that the HID core gets a
shift-out-of-bounds exception when it tries to convert a 32-bit
quantity to a 0-bit quantity. Ideally this should never occur, but
there are buggy devices and some might have a report field with size
set to zero; we shouldn't reject the report or the device just because
of that.
Instead, harden the s32ton() routine so that it returns a reasonable
result instead of crashing when it is called with the number of bits
set to 0 -- the same as what snto32() does.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: syzbot+b63d677d63bcac06cf90@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-usb/68753a08.050a0220.33d347.0008.GAE@google.com/
Tested-by: syzbot+b63d677d63bcac06cf90@syzkaller.appspotmail.com
Fixes: dde5845a529f ("[PATCH] Generic HID layer - code split")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/613a66cd-4309-4bce-a4f7-2905f9bce0c9@rowland.harvard.edu
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
|
|
Fix kernel-doc formatting issue causing unexpected indentation error
in ctucanfd driver documentation build. Convert main return values
to bullet list format while preserving numbered sub-list in order to
correct indentation error and visual structure in rendered html.
Signed-off-by: Luis Felipe Hernandez <luis.hernandez093@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Pavel Pisa <pisa@fel.cvut.cz>
Link: https://patch.msgid.link/20250722035352.21807-1-luis.hernandez093@gmail.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add a dependency on PC104 to limit (restrict) this driver kconfig
prompt to kernel configs that have PC104 set.
Add COMPILE_TEST as a possibility for more complete build coverage.
I tested this build config on x86_64 5 times without problems.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://patch.msgid.link/20250721002823.3548945-1-rdunlap@infradead.org
[mkl: fix conflict, remove Fixes: tag]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|