Age | Commit message (Collapse) | Author |
|
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:
WARNING: possible circular locking dependency detected
6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.1/16154 is trying to acquire lock:
ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
but task is already holding lock:
ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-log-00){++++}-{3:3}:
down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
__btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #1 (btrfs-tree-00){++++}-{3:3}:
__lock_release kernel/locking/lockdep.c:5382 [inline]
lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
search_leaf fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
__btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
__btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3097 [inline]
check_prevs_add kernel/locking/lockdep.c:3216 [inline]
validate_chain kernel/locking/lockdep.c:3831 [inline]
__lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
lock_acquire kernel/locking/lockdep.c:5668 [inline]
lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
__mutex_lock_common kernel/locking/mutex.c:603 [inline]
__mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
__btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
__btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
evict+0x2ed/0x6b0 fs/inode.c:664
dispose_list+0x117/0x1e0 fs/inode.c:697
prune_icache_sb+0xeb/0x150 fs/inode.c:896
super_cache_scan+0x391/0x590 fs/super.c:106
do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
shrink_slab_memcg mm/vmscan.c:912 [inline]
shrink_slab+0x388/0x660 mm/vmscan.c:991
shrink_node_memcgs mm/vmscan.c:6088 [inline]
shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
shrink_zones mm/vmscan.c:6355 [inline]
do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
try_charge mm/memcontrol.c:2827 [inline]
charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
__mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
__filemap_add_folio+0x615/0xf80 mm/filemap.c:852
filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
__filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
find_or_create_page include/linux/pagemap.h:612 [inline]
alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
__btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
update_log_root fs/btrfs/tree-log.c:2841 [inline]
btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-log-00);
lock(btrfs-tree-00);
lock(btrfs-log-00);
lock(&delayed_node->mutex);
Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:
1) Modifying the log tree triggers an extent buffer allocation while
holding a write lock on a parent extent buffer from the log tree.
Allocating the pages for an extent buffer, or the extent buffer
struct, can trigger inode eviction and finally the inode eviction
will trigger a release/remove of a delayed node, which requires
taking the delayed node's mutex;
2) Allocating a metadata extent for a log tree can trigger the async
reclaim thread and make us wait for it to release enough space and
unblock our reservation ticket. The reclaim thread can start flushing
delayed items, and that in turn results in the need to lock delayed
node mutexes and in the need to write lock extent buffers of a
subvolume tree - all this while holding a write lock on the parent
extent buffer in the log tree.
So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.
Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.
Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Otherwise the kernel memory allocator seems to be unhappy about failing
order 6 allocations for the zones array, that cause 100% reproducible
mount failures in my qemu setup:
[26.078981] mount: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
[26.079741] CPU: 0 PID: 2965 Comm: mount Not tainted 6.1.0-rc5+ #185
[26.080181] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[26.080950] Call Trace:
[26.081132] <TASK>
[26.081291] dump_stack_lvl+0x56/0x6f
[26.081554] warn_alloc+0x117/0x140
[26.081808] ? __alloc_pages_direct_compact+0x1b5/0x300
[26.082174] __alloc_pages_slowpath.constprop.0+0xd0e/0xde0
[26.082569] __alloc_pages+0x32a/0x340
[26.082836] __kmalloc_large_node+0x4d/0xa0
[26.083133] ? trace_kmalloc+0x29/0xd0
[26.083399] kmalloc_large+0x14/0x60
[26.083654] btrfs_get_dev_zone_info+0x1b9/0xc00
[26.083980] ? _raw_spin_unlock_irqrestore+0x28/0x50
[26.084328] btrfs_get_dev_zone_info_all_devices+0x54/0x80
[26.084708] open_ctree+0xed4/0x1654
[26.084974] btrfs_mount_root.cold+0x12/0xde
[26.085288] ? lock_is_held_type+0xe2/0x140
[26.085603] legacy_get_tree+0x28/0x50
[26.085876] vfs_get_tree+0x1d/0xb0
[26.086139] vfs_kern_mount.part.0+0x6c/0xb0
[26.086456] btrfs_mount+0x118/0x3a0
[26.086728] ? lock_is_held_type+0xe2/0x140
[26.087043] legacy_get_tree+0x28/0x50
[26.087323] vfs_get_tree+0x1d/0xb0
[26.087587] path_mount+0x2ba/0xbe0
[26.087850] ? _raw_spin_unlock_irqrestore+0x38/0x50
[26.088217] __x64_sys_mount+0xfe/0x140
[26.088506] do_syscall_64+0x35/0x80
[26.088776] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If kzalloc() for 'ap_qci_info_old' failed, 'ap_qci_info' shold be
freed before return. Otherwise it is a memory leak.
Link: https://lore.kernel.org/r/20221114110830.542246-1-weiyongjun@huaweicloud.com
Fixes: 283915850a44 ("s390/ap: notify drivers on config changed and scan complete callbacks")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
root cause that S2A need to use deduct offset flag.
after setting this flag, vcn0 doorbell value works.
so return it as before
Signed-off-by: Jane Jian <Jane.Jian@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
when the edid is read"
This partially reverts 20543be93ca45968f344261c1a997177e51bd7e1.
Calling drm_connector_update_edid_property() in
amdgpu_connector_free_edid() causes a noticeable pause in
the system every 10 seconds on polled outputs so revert this
part of the change.
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2257
Cc: Claudio Suarez <cssk@net-c.es>
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
[why]
First MST sideband message returns AUX_RET_ERROR_HPD_DISCON
on certain intel platform. Aux transaction considered failure
if HPD unexpected pulled low. The actual aux transaction success
in such case, hence do not return error.
[how]
Not returning error when AUX_RET_ERROR_HPD_DISCON detected
on the first sideband message.
v2: squash in fix (Alex)
Reviewed-by: Jerry Zuo <Jerry.Zuo@amd.com>
Acked-by: Brian Chang <Brian.Chang@amd.com>
Signed-off-by: Tsung-hua Lin <Tsung-hua.Lin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
[Why]
[ 754.862560] refcount_t: underflow; use-after-free.
[ 754.862898] Call Trace:
[ 754.862903] <TASK>
[ 754.862913] amdgpu_job_free_cb+0xc2/0xe1 [amdgpu]
[ 754.863543] drm_sched_main.cold+0x34/0x39 [amd_sched]
[How]
The fw_fence may be not init, check whether dma_fence_init
is performed before job free
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
update driver if header for smu_13_0_7
Signed-off-by: lyndonli <Lyndon.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
|
|
[Why]
Underflow is observed when cursor is still enabled when the cursor
rectangle is outside the bounds of it's surface viewport.
[How]
Update parameters used to determine when cursor should be disabled.
Reviewed-by: Martin Leung <Martin.Leung@amd.com>
Acked-by: Brian Chang <Brian.Chang@amd.com>
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[WHY?]
When calculating watermark and dlg values, the max mclk level index and
associated speed are needed to find the correlated dummy latency value.
Currently the incorrect index is given due to a clock manager refactor.
[HOW?]
Use num_memclk_level from num_entries_per_clk struct for getting the correct max
mem speed.
Reviewed-by: Jun Lei <Jun.Lei@amd.com>
Acked-by: Brian Chang <Brian.Chang@amd.com>
Signed-off-by: Dillon Varone <Dillon.Varone@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
Pixel rate divider values should never be set to N/A (0xF) as the K1/K2
field is only 1/2 bits wide.
[How]
Set valid divider values for virtual and FRL/DP2 cases.
Reviewed-by: Nicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
Acked-by: Brian Chang <Brian.Chang@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[WHY?]
MALL allocation size depends on the viewport height, not the addressable
vertical lines, which will not match when scaling.
[HOW?]
Base MALL allocation size calculations off viewport height.
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Reviewed-by: Martin Leung <Martin.Leung@amd.com>
Acked-by: Brian Chang <Brian.Chang@amd.com>
Signed-off-by: Dillon Varone <Dillon.Varone@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Description]
New values for soc bounding box and dummy pstate.
Reviewed-by: Jun Lei <Jun.Lei@amd.com>
Acked-by: Brian Chang <Brian.Chang@amd.com>
Signed-off-by: Dillon Varone <Dillon.Varone@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
|
|
There's been a very long running bug that seems to have been neglected for
a while, where amdgpu consistently triggers a KASAN error at start:
BUG: KASAN: global-out-of-bounds in read_indirect_azalia_reg+0x1d4/0x2a0 [amdgpu]
Read of size 4 at addr ffffffffc2274b28 by task modprobe/1889
After digging through amd's rather creative method for accessing registers,
I eventually discovered the problem likely has to do with the fact that on
my dce120 GPU there are supposedly 7 sets of audio registers. But we only
define a register mapping for 6 sets.
So, fix this and fix the KASAN warning finally.
Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
We can reuse the same buffers on resume.
v2: squash in S4 fix from Shikai
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2213
Reviewed-by: Christian König <christian.koenig@amd.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
ZLP for DisplayLink ethernet devices was enabled in 6.0:
266c0190aee3 ("net/cdc_ncm: Enable ZLP for DisplayLink ethernet devices").
The related driver_info should be the "same as cdc_ncm_info, but with
FLAG_SEND_ZLP". However, set_rx_mode that enables handling multicast
traffic was missing in the new cdc_ncm_zlp_info.
usbnet_cdc_update_filter rx mode was introduced in linux 5.9 with:
e10dcb1b6ba7 ("net: cdc_ncm: hook into set_rx_mode to admit multicast
traffic")
Without this hook, multicast, and then IPv6 SLAAC, is broken.
Fixes: 266c0190aee3 ("net/cdc_ncm: Enable ZLP for DisplayLink ethernet devices")
Signed-off-by: Santiago Ruano Rincón <santiago.ruano-rincon@imt-atlantique.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add RmNet support for LARA-L6.
LARA-L6 module can be configured (by AT interface) in three different
USB modes:
* Default mode (Vendor ID: 0x1546 Product ID: 0x1341) with 4 serial
interfaces
* RmNet mode (Vendor ID: 0x1546 Product ID: 0x1342) with 4 serial
interfaces and 1 RmNet virtual network interface
* CDC-ECM mode (Vendor ID: 0x1546 Product ID: 0x1343) with 4 serial
interface and 1 CDC-ECM virtual network interface
In RmNet mode LARA-L6 exposes the following interfaces:
If 0: Diagnostic
If 1: AT parser
If 2: AT parser
If 3: AT parset/alternative functions
If 4: RMNET interface
Signed-off-by: Davide Tronchin <davide.tronchin.94@gmail.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When holding a reader-writer spin lock we cannot sleep. Calling
setup_udp_tunnel_sock() with write lock held violates this rule, because we
end up calling percpu_down_read(), which might sleep, as syzbot reports
[1]:
__might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
Trim the writer-side critical section for sk_callback_lock down to the
minimum, so that it covers only operations on sk_user_data.
Also, when grabbing the sk_callback_lock, we always need to disable BH, as
Eric points out. Failing to do so leads to deadlocks because we acquire
sk_callback_lock in softirq context, which can get stuck waiting on us if:
1) it runs on the same CPU, or
CPU0
----
lock(clock-AF_INET6);
<Interrupt>
lock(clock-AF_INET6);
2) lock ordering leads to priority inversion
CPU0 CPU1
---- ----
lock(clock-AF_INET6);
local_irq_disable();
lock(&tcp_hashinfo.bhash[i].lock);
lock(clock-AF_INET6);
<Interrupt>
lock(&tcp_hashinfo.bhash[i].lock);
... as syzbot reports [2,3]. Use the _bh variants for write_(un)lock.
[1] https://lore.kernel.org/netdev/0000000000004e78ec05eda79749@google.com/
[2] https://lore.kernel.org/netdev/000000000000e38b6605eda76f98@google.com/
[3] https://lore.kernel.org/netdev/000000000000dfa31e05eda76f75@google.com/
v2:
- Check and set sk_user_data while holding sk_callback_lock for both
L2TP encapsulation types (IP and UDP) (Tetsuo)
Cc: Tom Parkin <tparkin@katalix.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: b68777d54fac ("l2tp: Serialize access to sk_user_data with sk_callback_lock")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+703d9e154b3b58277261@syzkaller.appspotmail.com
Reported-by: syzbot+50680ced9e98a61f7698@syzkaller.appspotmail.com
Reported-by: syzbot+de987172bb74a381879b@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The dm9051_loop_rx() returns without release skb when dm9051_stop_mrcmd()
returns error, free the skb to avoid this leak.
Fixes: 2dc95a4d30ed ("net: Add dm9051 driver")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In com20020_probe(), if com20020_config() fails, dev and info
will not be freed, which will lead to a memory leak.
This patch adds freeing dev and info after com20020_config()
fails to fix this bug.
Compile tested only.
Fixes: 15b99ac17295 ("[PATCH] pcmcia: add return value to _config() functions")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes
Fixes to make the automated binding tools happier (node-names,
undocumented + unneeded properties) and fixes for non-working
devices on some boards.
* tag 'v6.1-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
arm64: dts: rockchip: Fix Pine64 Quartz4-B PMIC interrupt
ARM: dts: rockchip: rk3188: fix lcdc1-rgb24 node name
arm64: dts: rockchip: fix ir-receiver node names
ARM: dts: rockchip: fix ir-receiver node names
arm64: dts: rockchip: fix adc-keys sub node names
ARM: dts: rockchip: fix adc-keys sub node names
arm: dts: rockchip: remove clock-frequency from rtc
arm: dts: rockchip: fix node name for hym8563 rtc
arm64: dts: rockchip: remove clock-frequency from rtc
arm64: dts: rockchip: fix node name for hym8563 rtc
arm64: dts: rockchip: lower rk3399-puma-haikou SD controller clock frequency
arm64: dts: rockchip: keep I2S1 disabled for GPIO function on ROCK Pi 4 series
arm64: dts: rockchip: fix quartz64-a bluetooth configuration
arm64: dts: rockchip: add enable-strobe-pulldown to emmc phy on nanopi4
arm64: dts: rockchip: remove i2c5 from rk3566-roc-pc
arm64: dts: rockchip: Fix i2c3 pinctrl on rk3566-roc-pc
arm64: dts: rockchip: Fix gmac failure of rgmii-id from rk3566-roc-pc
arm64: dts: rockchip: Drop RK3399-Scarlet's repeated ec_ap_int_l definition
Link: https://lore.kernel.org/r/6274427.GXAFRqVoOG@phil
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
trans_xen did not check the data fits into the buffer before copying
from the xen ring, but we probably should.
Add a check that just skips the request and return an error to
userspace if it did not fit
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Link: https://lkml.kernel.org/r/20221118135542.63400-1-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-11-21
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2022-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Fix possible race condition in macsec extended packet number update routine
net/mlx5e: Fix MACsec update SecY
net/mlx5e: Fix MACsec SA initialization routine
net/mlx5e: Remove leftovers from old XSK queues enumeration
net/mlx5e: Offload rule only when all encaps are valid
net/mlx5e: Fix missing alignment in size of MTT/KLM entries
net/mlx5: Fix sync reset event handler error flow
net/mlx5: E-Switch, Set correctly vport destination
net/mlx5: Lag, avoid lockdep warnings
net/mlx5: Fix handling of entry refcount when command is not issued to FW
net/mlx5: cmdif, Print info on any firmware cmd failure to tracepoint
net/mlx5: SF: Fix probing active SFs during driver probe phase
net/mlx5: Fix FW tracer timestamp calculation
net/mlx5: Do not query pci info while pci disabled
====================
Link: https://lore.kernel.org/r/20221122022559.89459-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In fib_table_insert(), if the alias was already inserted, but node not
exist, the error code should be set before return from error handling path.
Fixes: a6c76c17df02 ("ipv4: Notify route after insertion to the routing table")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Link: https://lore.kernel.org/r/20221120072838.2167047-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Yan Cangang says:
====================
net: ethernet: mtk_eth_soc: fix memory leak in error path
====================
Link: https://lore.kernel.org/r/20221120055259.224555-1-nalanzeyu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In mtk_ppe_init(), when dmam_alloc_coherent() or devm_kzalloc() failed,
the rhashtable ppe->l2_flows isn't destroyed. Fix it.
In mtk_probe(), when mtk_ppe_init() or mtk_eth_offload_init() or
register_netdev() failed, have the same problem. Fix it.
Fixes: 33fc42de3327 ("net: ethernet: mtk_eth_soc: support creating mac address based offload entries")
Signed-off-by: Yan Cangang <nalanzeyu@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In mtk_probe(), when mtk_ppe_init() or mtk_eth_offload_init() failed,
mtk_mdio_cleanup() isn't called. Fix it.
Fixes: ba37b7caf1ed ("net: ethernet: mtk_eth_soc: add support for initializing the PPE")
Fixes: 502e84e2382d ("net: ethernet: mtk_eth_soc: add flow offloading support")
Signed-off-by: Yan Cangang <nalanzeyu@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When fail to dma_map_single() in mtk_rx_alloc(), it returns directly.
But the memory allocated for local variable data is not freed, and
local variabel data has not been attached to ring->data[i] yet, so the
memory allocated for local variable data will not be freed outside
mtk_rx_alloc() too. Thus memory leak would occur in this scenario.
Add skb_free_frag(data) when dma_map_single() failed.
Fixes: 23233e577ef9 ("net: ethernet: mtk_eth_soc: rely on page_pool for single page buffers")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Link: https://lore.kernel.org/r/20221120035405.1464341-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
'dccp-tcp-fix-bhash2-issues-related-to-warn_on-in-inet_csk_get_port'
Kuniyuki Iwashima says:
====================
dccp/tcp: Fix bhash2 issues related to WARN_ON() in inet_csk_get_port().
syzkaller was hitting a WARN_ON() in inet_csk_get_port() in the 4th patch,
which was because we forgot to fix up bhash2 bucket when connect() for a
socket bound to a wildcard address fails in __inet_stream_connect().
There was a similar report [0], but its repro does not fire the WARN_ON() due
to inconsistent error handling.
When connect() for a socket bound to a wildcard address fails, saddr may or
may not be reset depending on where the failure happens. When we fail in
__inet_stream_connect(), sk->sk_prot->disconnect() resets saddr. OTOH, in
(dccp|tcp)_v[46]_connect(), if we fail after inet_hash6?_connect(), we
forget to reset saddr.
We fix this inconsistent error handling in the 1st patch, and then we'll
fix the bhash2 WARN_ON() issue.
Note that there is still an issue in that we reset saddr without checking
if there are conflicting sockets in bhash and bhash2, but this should be
another series.
See [1][2] for the previous discussion.
[0]: https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
[1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
[2]: https://lore.kernel.org/netdev/20221103172419.20977-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/20221118081906.053d5231@kernel.org/T/#m00aafedb29ff0b55d5e67aef0252ef1baaf4b6ee
====================
Link: https://lore.kernel.org/r/20221119014914.31792-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If a socket bound to a wildcard address fails to connect(), we
only reset saddr and keep the port. Then, we have to fix up the
bhash2 bucket; otherwise, the bucket has an inconsistent address
in the list.
Also, listen() for such a socket will fire the WARN_ON() in
inet_csk_get_port(). [0]
Note that when a system runs out of memory, we give up fixing the
bucket and unlink sk from bhash and bhash2 by inet_put_port().
[0]:
WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Modules linked in:
CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
FS: 00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
inet_listen (net/ipv4/af_inet.c:228)
__sys_listen (net/socket.c:1810)
__x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7f8ac051de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
</TASK>
Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we call connect() for a socket bound to a wildcard address, we update
saddr locklessly. However, it could result in a data race; another thread
iterating over bhash might see a corrupted address.
Let's update saddr under the bhash bucket's lock.
Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL.
Let's remove the unnecessary test.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When connect() is called on a socket bound to the wildcard address,
we change the socket's saddr to a local address. If the socket
fails to connect() to the destination, we have to reset the saddr.
However, when an error occurs after inet_hash6?_connect() in
(dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
the socket bound to the address.
From the user's point of view, whether saddr is reset or not varies
with errno. Let's fix this inconsistent behaviour.
Note that after this patch, the repro [0] will trigger the WARN_ON()
in inet_csk_get_port() again, but this patch is not buggy and rather
fixes a bug papering over the bhash2's bug for which we need another
fix.
For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
by this sequence:
s1 = socket()
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.bind(('127.0.0.1', 10000))
s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
# or s1.connect(('127.0.0.1', 10000))
s2 = socket()
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.bind(('0.0.0.0', 10000))
s2.connect(('127.0.0.1', 10000)) # -EADDRNOTAVAIL
s2.listen(32) # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
[0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If KPROBES_SANITY_TEST and ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled, but
STACKTRACE is not set. Build failed as below:
lib/test_kprobes.c: In function `stacktrace_return_handler':
lib/test_kprobes.c:228:8: error: implicit declaration of function `stack_trace_save'; did you mean `stacktrace_driver'? [-Werror=implicit-function-declaration]
ret = stack_trace_save(stack_buf, STACK_BUF_SIZE, 0);
^~~~~~~~~~~~~~~~
stacktrace_driver
cc1: all warnings being treated as errors
scripts/Makefile.build:250: recipe for target 'lib/test_kprobes.o' failed
make[2]: *** [lib/test_kprobes.o] Error 1
To fix this error, Select STACKTRACE if ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled.
Link: https://lkml.kernel.org/r/20221121030620.63181-1-hucool.lihua@huawei.com
Fixes: 1f6d3a8f5e39 ("kprobes: Add a test case for stacktrace from kretprobe handler")
Signed-off-by: Li Hua <hucool.lihua@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When extending segments, nilfs_sufile_alloc() is called to get an
unassigned segment, then mark it as dirty to avoid accidentally allocating
the same segment in the future.
But for some special cases such as a corrupted image it can be unreliable.
If such corruption of the dirty state of the segment occurs, nilfs2 may
reallocate a segment that is in use and pick the same segment for writing
twice at the same time.
This will cause the problem reported by syzkaller:
https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24
This case started with segbuf1.segnum = 3, nextnum = 4 when constructed.
It supposed segment 4 has already been allocated and marked as dirty.
However the dirty state was corrupted and segment 4 usage was not dirty.
For the first time nilfs_segctor_extend_segments() segment 4 was allocated
again, which made segbuf2 and next segbuf3 had same segment 4.
sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
to both buffer lists of two segbuf. It makes the lists broken which
causes NULL pointer dereference.
Fix the problem by setting usage as dirty every time in
nilfs_sufile_mark_dirty(), which is called during constructing current
segment to be written out and before allocating next segment.
[chenzhongjin@huawei.com: add lock protection per Ryusuke]
Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com
Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com
Fixes: 9ff05123e3bf ("nilfs2: segment constructor")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com>
Reported-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
balance_dirty_pages doesn't do the required dirty throttling on cgroupv1.
See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
on traditional hierarchies"). Instead, the kernel depends on writeback
throttling in shrink_folio_list to achieve the same goal. With large
memory systems, the flusher may not be able to writeback quickly enough
such that we will start finding pages in the shrink_folio_list already in
writeback. Hence for cgroupv1 let's do a reclaim throttle after waking up
the flusher.
The below test which used to fail on a 256GB system completes till the the
file system is full with this change.
root@lp2:/sys/fs/cgroup/memory# mkdir test
root@lp2:/sys/fs/cgroup/memory# cd test/
root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes
root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks
root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M
Killed
Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: zefan li <lizefan.x@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When we specify __GFP_NOWARN, we only expect that no warnings will be
issued for current caller. But in the __should_failslab() and
__should_fail_alloc_page(), the local GFP flags alter the global
{failslab|fail_page_alloc}.attr, which is persistent and shared by all
tasks. This is not what we expected, let's fix it.
[akpm@linux-foundation.org: unexport should_fail_ex()]
Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
Fixes: 3f913fc5f974 ("mm: fix missing handler for __GFP_NOWARN")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A softlockup occurs in scan free swap slot under huge memory pressure.
The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
disksize of each zram device is 50MB.
LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
the real loop number would more than LATENCY_LIMIT because of "goto checks
and goto scan" repeatly without decreasing latency limit.
In order to fix it, decrease latency_ration in advance.
There is also a suspicious place that will cause softlockups in
get_swap_pages(). In this function, the "goto start_over" may result in
continuous scanning of the swap partition. If there is no cond_sched in
scan_swap_map_slots(), it would cause a softlockup (I am not sure about
this).
WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
dump backtrace+0x0/0x1le4
show stack+0x20/@x2c
dump_stack+0xd8/0x140
watchdog print_info+0x48/0x54
watchdog_process_before_softlockup+0x98/0xa0
watchdog_timer_fn+0xlac/0x2d0
hrtimer_rum_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handLe_percpu_devid_irq+0x90/0x1f4
handle domain irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
e11 ira+0xhB/Bx140
scan_swap_map_slots+0x678/0x890
get_swap_pages+0x29c/0x440
get_swap_page+0x120/0x2e0
add_to_swap+UX2U/0XyC
shrink_page_list+0x5d0/0x152c
shrink_inactive_list+0xl6c/Bx500
shrink_lruvec+0x270/0x304
WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
watchdog_timer_fn+0x1ac/0x2d0
__run_hrtimer+0x98/0x2a0
__hrtimer_run_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handle_percpu_devid_irq+0x90/0x1f4
__handle_domain_irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
el1_irq+0xb8/0x140
get_swap_pages+0x1e8/0x440
get_swap_page+0x1c8/0x2e0
add_to_swap+0x20/0x9c
shrink_page_list+0x5d0/0x152c
reclaim_pages+0x160/0x310
madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
walk_pmd_range.isra.0+0xac/0x22c
walk_pud_range+0xfc/0x1c0
walk_pgd_range+0x158/0x1b0
__walk_page_range+0x64/0x100
walk_page_range+0x104/0x150
Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
Fixes: 048c27fd7281 ("[PATCH] swap: scan_swap_map latency breaks")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: <xialonglong1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 2b21624fc232 ("hugetlb: freeze allocated pages before creating
hugetlb pages") changed the order page flags were cleared and set in the
head page. It moved the __ClearPageReserved after __SetPageHead.
However, there is a check to make sure __ClearPageReserved is never done
on a head page. If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
will be hit when creating a hugetlb gigantic page:
page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:500!
Call Trace will differ depending on whether hugetlb page is created
at boot time or run time.
Make sure to __ClearPageReserved BEFORE __SetPageHead.
Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
Fixes: 2b21624fc232 ("hugetlb: freeze allocated pages before creating hugetlb pages")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Tarun Sahu <tsahu@linux.ibm.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem")
refactored large parts of the kmalloc subsystem, resulting in the stack
trace pruning logic done by KFENCE to no longer work.
While b14051352465 attempted to fix the situation by including
'__kmem_cache_free' in the list of functions KFENCE should skip through,
this only works when the compiler actually optimized the tail call from
kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
appearing in the full stack trace to begin with).
In some configurations, the compiler no longer optimizes the tail call
into a jump, and __kmem_cache_free() appears in the stack trace. This
means that the pruned stack trace shown by KFENCE would include kfree()
which is not intended - for example:
| BUG: KFENCE: invalid free in kfree+0x7c/0x120
|
| Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
| kfree+0x7c/0x120
| test_double_free+0x116/0x1a9
| kunit_try_run_case+0x90/0xd0
| [...]
Fix it by moving __kmem_cache_free() to the list of functions that may be
tail called by an allocator entry function, making the pruning logic work
in both the optimized and unoptimized tail call cases.
Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
Fixes: b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
SecPageTables has a tab after it instead of a space, this can break
fragile parsers that depend on spaces after the stat names.
Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
Fixes: ebc97a52b5d6cd5f ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page reclaim isolates a batch of folios from the tail of one of the
LRU lists and works on those folios one by one. For a suitable
swap-backed folio, if the swap device is async, it queues that folio for
writeback. After the page reclaim finishes an entire batch, it puts back
the folios it queued for writeback to the head of the original LRU list.
In the meantime, the page writeback flushes the queued folios also by
batches. Its batching logic is independent from that of the page reclaim.
For each of the folios it writes back, the page writeback calls
folio_rotate_reclaimable() which tries to rotate a folio to the tail.
folio_rotate_reclaimable() only works for a folio after the page reclaim
has put it back. If an async swap device is fast enough, the page
writeback can finish with that folio while the page reclaim is still
working on the rest of the batch containing it. In this case, that folio
will remain at the head and the page reclaim will not retry it before
reaching there.
This patch adds a retry to evict_folios(). After evict_folios() has
finished an entire batch and before it puts back folios it cannot free
immediately, it retries those that may have missed the rotation.
Before this patch, ~60% of folios swapped to an Intel Optane missed
folio_rotate_reclaimable(). After this patch, ~99% of missed folios were
reclaimed upon retry.
This problem affects relatively slow async swap devices like Samsung 980
Pro much less and does not affect sync swap devices like zram or zswap at
all.
Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add and also update email address, skakit@codeaurora.org is no longer
active.
Link: https://lkml.kernel.org/r/20221116105017.3018971-1-quic_c_skakit@quicinc.com
Signed-off-by: Satya Priya <quic_c_skakit@quicinc.com>
Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
migrate_vma->cpages originally contained a count of the number of pages
migrating including non-present pages which can be populated directly on
the target.
Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and
migrate_device_coherent_page()") inadvertantly changed this to contain
just the number of pages that were unmapped. Usage of migrate_vma->cpages
isn't documented, but most drivers use it to see if all the requested
addresses can be migrated so restore the original behaviour.
Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
Fixes: 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add missing <linux/string.h> include for strcmp.
Clang 16 makes -Wimplicit-function-declaration an error by default.
Unfortunately, out of tree modules may use this in configure scripts,
which means failure might cause silent miscompilation or misconfiguration.
For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
or the (new) c-std-porting mailing list [3].
[0] https://lwn.net/Articles/913505/
[1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
[2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
[3] hosted at lists.linux.dev.
[akpm@linux-foundation.org: remember "linux/"]
Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.org
Signed-off-by: Sam James <sam@gentoo.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use my personal email address.
Link: https://lkml.kernel.org/r/20221114001302.671897-2-alex.hung@amd.com
Signed-off-by: Alex Hung <alexhung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
I am no longer at Canonical and add entry of my personal email address.
Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.com
Signed-off-by: Alex Hung <alexhung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the struct_mm input, mm, was changed to a struct ma_state, mas, the
documentation for the function was never updated. This updates that
documentation reference.
Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aero
Signed-off-by: Ian Cowan <ian@linux.cowan.aero>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A DAMON sysfs interface user can start DAMON with a scheme, remove the
sysfs directory for the scheme, and then ask update of the scheme's stats.
Because the schemes stats update logic isn't aware of the situation, it
results in an invalid memory access. Fix the bug by checking if the
scheme sysfs directory exists.
Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
Fixes: 0ac32b8affb5 ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [v5.18]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The migrate_to_ram() callback should always succeed, but in rare cases can
fail usually returning VM_FAULT_SIGBUS. Commit 16ce101db85d
("mm/memory.c: fix race when faulting a device private page") incorrectly
stopped passing the return code up the stack. Fix this by setting the ret
variable, restoring the previous behaviour on migrate_to_ram() failure.
Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|