summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-11-24net/mlx5e: MACsec, fix memory leak when MACsec device is deletedRaed Salem
When the MACsec netdevice is deleted, all related Rx/Tx HW/SW states should be released/deallocated, however currently part of the Rx security channel association data is not cleaned properly, hence the memory leaks. Fix by make sure all related Rx Sc resources are cleaned/freed, while at it improve code by grouping release SC context in a function so it can be used in both delete MACsec device and delete Rx SC operations. Fixes: 5a39816a75e5 ("net/mlx5e: Add MACsec offload SecY support") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Emeel Hakim <ehakim@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5e: MACsec, fix RX data path 16 RX security channel limitRaed Salem
Currently the data path metadata flow id mask wrongly limits the number of different RX security channels (SC) to 16, whereas in adding RX SC the limit is "2^16 - 1" this cause an overlap in metadata flow id once more than 16 RX SCs is added, this corrupts MACsec RX offloaded flow handling. Fix by using the correct mask, while at it improve code to use this mask when adding the Rx rule and improve visibility of such errors by adding debug massage. Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Emeel Hakim <ehakim@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5e: Use kvfree() in mlx5e_accel_fs_tcp_create()YueHaibing
'accel_tcp' is allocated by kvzalloc(), which should freed by kvfree(). Fixes: f52f2faee581 ("net/mlx5e: Introduce flow steering API") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5e: Fix a couple error codesDan Carpenter
If kvzalloc() fails then return -ENOMEM. Don't return success. Fixes: 3b20949cb21b ("net/mlx5e: Add MACsec RX steering rules") Fixes: e467b283ffd5 ("net/mlx5e: Add MACsec TX steering rules") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5e: Fix use-after-free when reverting termination tableRoi Dayan
When having multiple dests with termination tables and second one or afterwards fails the driver reverts usage of term tables but doesn't reset the assignment in attr->dests[num_vport_dests].termtbl which case a use-after-free when releasing the rule. Fix by resetting the assignment of termtbl to null. Fixes: 10caabdaad5a ("net/mlx5e: Use termination table for VLAN push actions") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5: Fix uninitialized variable bug in outlen_write()YueHaibing
If sscanf() return 0, outlen is uninitialized and used in kzalloc(), this is unexpected. We should return -EINVAL if the string is invalid. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5: E-switch, Fix duplicate lag creationChris Mi
If creating bond first and then enabling sriov in switchdev mode, will hit the following syndrome: mlx5_core 0000:08:00.0: mlx5_cmd_out_err:778:(pid 25543): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb), err(-22) The reason is because the offending patch removes eswitch mode none. In vf lag, the checking of eswitch mode none is replaced by checking if sriov is enabled. But when driver enables sriov, it triggers the bond workqueue task first and then setting sriov number in pci_enable_sriov(). So the check fails. Fix it by checking if sriov is enabled using eswitch internal counter that is set before triggering the bond workqueue task. Fixes: f019679ea5f2 ("net/mlx5: E-switch, Remove dependency between sriov and eswitch mode") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5: E-switch, Destroy legacy fdb table when neededChris Mi
The cited commit removes eswitch mode none. But when disabling sriov in legacy mode or changing from switchdev to legacy mode without sriov enabled, the legacy fdb table is not destroyed. It is not the right behavior. Destroy legacy fdb table in above two caes. Fixes: f019679ea5f2 ("net/mlx5: E-switch, Remove dependency between sriov and eswitch mode") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Eli Cohen <elic@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5: DR, Fix uninitialized var warningYueHaibing
Smatch warns this: drivers/net/ethernet/mellanox/mlx5/core/steering/dr_table.c:81 mlx5dr_table_set_miss_action() error: uninitialized symbol 'ret'. Initializing ret with -EOPNOTSUPP and fix missing action case. Fixes: 7838e1725394 ("net/mlx5: DR, Expose steering table functionality") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net: wwan: t7xx: Fix the ACPI memory leakHanjun Guo
The ACPI buffer memory (buffer.pointer) should be freed as the buffer is not used after acpi_evaluate_object(), free it to prevent memory leak. Fixes: 13e920d93e37 ("net: wwan: t7xx: Add core components") Signed-off-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/1669119580-28977-1-git-send-email-guohanjun@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-24octeontx2-pf: Add check for devm_kcallocJiasheng Jiang
As the devm_kcalloc may return NULL pointer, it should be better to add check for the return value, as same as the others. Fixes: e8e095b3b370 ("octeontx2-af: cn10k: Bandwidth profiles config support") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20221122055449.31247-1-jiasheng@iscas.ac.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-23net: enetc: preserve TX ring priority across reconfigurationVladimir Oltean
In the blamed commit, a rudimentary reallocation procedure for RX buffer descriptors was implemented, for the situation when their format changes between normal (no PTP) and extended (PTP). enetc_hwtstamp_set() calls enetc_close() and enetc_open() in a sequence, and this sequence loses information which was previously configured in the TX BDR Mode Register, specifically via the enetc_set_bdr_prio() call. The TX ring priority is configured by tc-mqprio and tc-taprio, and affects important things for TSN such as the TX time of packets. The issue manifests itself most visibly by the fact that isochron --txtime reports premature packet transmissions when PTP is first enabled on an enetc interface. Save the TX ring priority in a new field in struct enetc_bdr (occupies a 2 byte hole on arm64) in order to make this survive a ring reconfiguration. Fixes: 434cebabd3a2 ("enetc: Add dynamic allocation of extended Rx BD rings") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/r/20221122130936.1704151-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23net: marvell: prestera: add missing unregister_netdev() in ↵Zhang Changzhong
prestera_port_create() If prestera_port_sfp_bind() fails, unregister_netdev() should be called in error handling path. Compile tested only. Fixes: 52323ef75414 ("net: marvell: prestera: add phylink support") Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/1669115432-36841-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23Merge branch 'nfc-st-nci-restructure-validating-logic-in-evt_transaction'Jakub Kicinski
Martin Faltesek says: ==================== nfc: st-nci: Restructure validating logic in EVT_TRANSACTION These are the same 3 patches that were applied in st21nfca here: https://lore.kernel.org/netdev/20220607025729.1673212-1-mfaltesek@google.com with a couple minor differences. st-nci has nearly identical code to that of st21nfca for EVT_TRANSACTION, except that there are two extra validation checks that are not present in the st-nci code. The 3/3 patch as coded for st21nfca pulls those checks in, bringing both drivers into parity. ==================== Link: https://lore.kernel.org/r/20221122004246.4186422-1-mfaltesek@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23nfc: st-nci: fix incorrect sizing calculations in EVT_TRANSACTIONMartin Faltesek
The transaction buffer is allocated by using the size of the packet buf, and subtracting two which seems intended to remove the two tags which are not present in the target structure. This calculation leads to under counting memory because of differences between the packet contents and the target structure. The aid_len field is a u8 in the packet, but a u32 in the structure, resulting in at least 3 bytes always being under counted. Further, the aid data is a variable length field in the packet, but fixed in the structure, so if this field is less than the max, the difference is added to the under counting. To fix, perform validation checks progressively to safely reach the next field, to determine the size of both buffers and verify both tags. Once all validation checks pass, allocate the buffer and copy the data. This eliminates freeing memory on the error path, as validation checks are moved ahead of memory allocation. Reported-by: Denis Efremov <denis.e.efremov@oracle.com> Reviewed-by: Guenter Roeck <groeck@google.com> Fixes: 5d1ceb7f5e56 ("NFC: st21nfcb: Add HCI transaction event support") Signed-off-by: Martin Faltesek <mfaltesek@google.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23nfc: st-nci: fix memory leaks in EVT_TRANSACTIONMartin Faltesek
Error path does not free previously allocated memory. Add devm_kfree() to the failure path. Reported-by: Denis Efremov <denis.e.efremov@oracle.com> Reviewed-by: Guenter Roeck <groeck@google.com> Fixes: 5d1ceb7f5e56 ("NFC: st21nfcb: Add HCI transaction event support") Signed-off-by: Martin Faltesek <mfaltesek@google.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23nfc: st-nci: fix incorrect validating logic in EVT_TRANSACTIONMartin Faltesek
The first validation check for EVT_TRANSACTION has two different checks tied together with logical AND. One is a check for minimum packet length, and the other is for a valid aid_tag. If either condition is true (fails), then an error should be triggered. The fix is to change && to ||. Reported-by: Denis Efremov <denis.e.efremov@oracle.com> Reviewed-by: Guenter Roeck <groeck@google.com> Fixes: 5d1ceb7f5e56 ("NFC: st21nfcb: Add HCI transaction event support") Signed-off-by: Martin Faltesek <mfaltesek@google.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23ublk_drv: don't forward io commands in reserve orderMing Lei
Either ublk_can_use_task_work() is true or not, io commands are forwarded to ublk server in reverse order, since llist_add() is always to add one element to the head of the list. Even though block layer doesn't guarantee request dispatch order, requests should be sent to hardware in the sequence order generated from io scheduler, which usually considers the request's LBA, and order is often important for HDD. So forward io commands in the sequence made from io scheduler by aligning task work with current io_uring command's batch handling, and it has been observed that both can get similar performance data if IORING_SETUP_COOP_TASKRUN is set from ublk server. Reported-by: Andreas Hindborg <andreas.hindborg@wdc.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Link: https://lore.kernel.org/r/20221121155645.396272-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23Merge branch 'master' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== ipsec 2022-11-23 1) Fix "disable_policy" on ipv4 early demuxP Packets after the initial packet in a flow might be incorectly dropped on early demux if there are no matching policies. From Eyal Birger. 2) Fix a kernel warning in case XFRM encap type is not available. From Eyal Birger. 3) Fix ESN wrap around for GSO to avoid a double usage of a sequence number. From Christian Langrock. 4) Fix a send_acquire race with pfkey_register. From Herbert Xu. 5) Fix a list corruption panic in __xfrm_state_delete(). Thomas Jarosch. 6) Fix an unchecked return value in xfrm6_init(). Chen Zhongjin. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Fix ignored return value in xfrm6_init() xfrm: Fix oops in __xfrm_state_delete() af_key: Fix send_acquire race with pfkey_register xfrm: replay: Fix ESN wrap around for GSO xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available xfrm: fix "disable_policy" on ipv4 early demux ==================== Link: https://lore.kernel.org/r/20221123093117.434274-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Fix regression in ipset hash:ip with IPv4 range, from Vishwanath Pai. This is fixing up a bug introduced in the 6.0 release. 2) The "netfilter: ipset: enforce documented limit to prevent allocating huge memory" patch contained a wrong condition which makes impossible to add up to 64 clashing elements to a hash:net,iface type of set while it is the documented feature of the set type. The patch fixes the condition and thus makes possible to add the elements while keeps preventing allocating huge memory, from Jozsef Kadlecsik. This has been broken for several releases. 3) Missing locking when updating the flow block list which might lead a reader to crash. This has been broken since the introduction of the flowtable hardware offload support. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: flowtable_offload: add missing locking netfilter: ipset: restore allowing 64 clashing elements in hash:net,iface netfilter: ipset: regression in ip_set_hash_ip.c ==================== Link: https://lore.kernel.org/r/20221122212814.63177-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23Documentation: networking: Update generic_netlink_howto URLNir Levy
The documentation refers to invalid web page under www.linuxfoundation.org The patch refers to a working URL under wiki.linuxfoundation.org Signed-off-by: Nir Levy <bhr166@gmail.com> Link: https://lore.kernel.org/all/20221120220630.7443-1-bhr166@gmail.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-24scripts: add rust in scripts/Makefile.packageParan Lee
Add rust argument at TAR_CONTENT in scripts/Makefile.package script with alphabetical order. Signed-off-by: Paran Lee <p4ranlee@gmail.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-11-24kbuild: fix "cat: .version: No such file or directory"Masahiro Yamada
Since commit 2df8220cc511 ("kbuild: build init/built-in.a just once"), the .version file is not touched at all when KBUILD_BUILD_VERSION is given. If KBUILD_BUILD_VERSION is specified and the .version file is missing (for example right after 'make mrproper'), "No such file or director" is shown. Even if the .version exists, it is irrelevant to the version of the current build. $ make -j$(nproc) KBUILD_BUILD_VERSION=100 mrproper defconfig all [ snip ] BUILD arch/x86/boot/bzImage cat: .version: No such file or directory Kernel: arch/x86/boot/bzImage is ready (#) Show KBUILD_BUILD_VERSION if it is given. Fixes: 2df8220cc511 ("kbuild: build init/built-in.a just once") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2022-11-23tracing: Free buffers when a used dynamic event is removedSteven Rostedt (Google)
After 65536 dynamic events have been added and removed, the "type" field of the event then uses the first type number that is available (not currently used by other events). A type number is the identifier of the binary blobs in the tracing ring buffer (known as events) to map them to logic that can parse the binary blob. The issue is that if a dynamic event (like a kprobe event) is traced and is in the ring buffer, and then that event is removed (because it is dynamic, which means it can be created and destroyed), if another dynamic event is created that has the same number that new event's logic on parsing the binary blob will be used. To show how this can be an issue, the following can crash the kernel: # cd /sys/kernel/tracing # for i in `seq 65536`; do echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events # done For every iteration of the above, the writing to the kprobe_events will remove the old event and create a new one (with the same format) and increase the type number to the next available on until the type number reaches over 65535 which is the max number for the 16 bit type. After it reaches that number, the logic to allocate a new number simply looks for the next available number. When an dynamic event is removed, that number is then available to be reused by the next dynamic event created. That is, once the above reaches the max number, the number assigned to the event in that loop will remain the same. Now that means deleting one dynamic event and created another will reuse the previous events type number. This is where bad things can happen. After the above loop finishes, the kprobes/foo event which reads the do_sys_openat2 function call's first parameter as an integer. # echo 1 > kprobes/foo/enable # cat /etc/passwd > /dev/null # cat trace cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 # echo 0 > kprobes/foo/enable Now if we delete the kprobe and create a new one that reads a string: # echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events And now we can the trace: # cat trace sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U And dmesg has: ================================================================== BUG: KASAN: use-after-free in string+0xd4/0x1c0 Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049 CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: <TASK> dump_stack_lvl+0x5b/0x77 print_report+0x17f/0x47b kasan_report+0xad/0x130 string+0xd4/0x1c0 vsnprintf+0x500/0x840 seq_buf_vprintf+0x62/0xc0 trace_seq_printf+0x10e/0x1e0 print_type_string+0x90/0xa0 print_kprobe_event+0x16b/0x290 print_trace_line+0x451/0x8e0 s_show+0x72/0x1f0 seq_read_iter+0x58e/0x750 seq_read+0x115/0x160 vfs_read+0x11d/0x460 ksys_read+0xa9/0x130 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fc2e972ade2 Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2 RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003 RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 </TASK> The buggy address belongs to the physical page: page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== This was found when Zheng Yejian sent a patch to convert the event type number assignment to use IDA, which gives the next available number, and this bug showed up in the fuzz testing by Yujie Liu and the kernel test robot. But after further analysis, I found that this behavior is the same as when the event type numbers go past the 16bit max (and the above shows that). As modules have a similar issue, but is dealt with by setting a "WAS_ENABLED" flag when a module event is enabled, and when the module is freed, if any of its events were enabled, the ring buffer that holds that event is also cleared, to prevent reading stale events. The same can be done for dynamic events. If any dynamic event that is being removed was enabled, then make sure the buffers they were enabled in are now cleared. Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/ Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Depends-on: e18eb8783ec49 ("tracing: Add tracing_reset_all_online_cpus_unlocked() function") Depends-on: 5448d44c38557 ("tracing: Add unified dynamic event framework") Depends-on: 6212dd29683ee ("tracing/kprobes: Use dyn_event framework for kprobe events") Depends-on: 065e63f951432 ("tracing: Only have rmmod clear buffers that its events were active in") Depends-on: 575380da8b469 ("tracing: Only clear trace buffer on module unload if event was traced") Fixes: 77b44d1b7c283 ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event") Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Reported-by: Yujie Liu <yujie.liu@intel.com> Reported-by: kernel test robot <yujie.liu@intel.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23tracing: Add tracing_reset_all_online_cpus_unlocked() functionSteven Rostedt (Google)
Currently the tracing_reset_all_online_cpus() requires the trace_types_lock held. But only one caller of this function actually has that lock held before calling it, and the other just takes the lock so that it can call it. More users of this function is needed where the lock is not held. Add a tracing_reset_all_online_cpus_unlocked() function for the one use case that calls it without being held, and also add a lockdep_assert to make sure it is held when called. Then have tracing_reset_all_online_cpus() take the lock internally, such that callers do not need to worry about taking it. Link: https://lkml.kernel.org/r/20221123192741.658273220@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23tracing: Fix race where histograms can be called before the eventSteven Rostedt (Google)
commit 94eedf3dded5 ("tracing: Fix race where eprobes can be called before the event") fixed an issue where if an event is soft disabled, and the trigger is being added, there's a small window where the event sees that there's a trigger but does not see that it requires reading the event yet, and then calls the trigger with the record == NULL. This could be solved with adding memory barriers in the hot path, or to make sure that all the triggers requiring a record check for NULL. The latter was chosen. Commit 94eedf3dded5 set the eprobe trigger handle to check for NULL, but the same needs to be done with histograms. Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23Merge branch 'kvm-dwmw2-fixes' into HEADPaolo Bonzini
This brings in a few important fixes for Xen emulation. While nobody should be enabling it, the bug effectively allows userspace to read arbitrary memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23KVM: Update gfn_to_pfn_cache khva when it moves within the same pageDavid Woodhouse
In the case where a GPC is refreshed to a different location within the same page, we didn't bother to update it. Mostly we don't need to, but since the ->khva field also includes the offset within the page, that does have to be updated. Fixes: 3ba2c95ea180 ("KVM: Do not incorporate page offset into gfn=>pfn cache user address") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0David Woodhouse
There are almost no hypercalls which are valid from CPL > 0, and definitely none which are handled by the kernel. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Reported-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23KVM: x86/xen: Validate port number in SCHEDOP_pollDavid Woodhouse
We shouldn't allow guests to poll on arbitrary port numbers off the end of the event channel table. Fixes: 1a65105a5aba ("KVM: x86/xen: handle PV spinlocks slowpath") [dwmw2: my bug though; the original version did check the validity as a side-effect of an idr_find() which I ripped out in refactoring.] Reported-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23KVM: x86/mmu: Fix race condition in direct_page_faultKazuki Takiguchi
make_mmu_pages_available() must be called with mmu_lock held for write. However, if the TDP MMU is used, it will be called with mmu_lock held for read. This function does nothing unless shadow pages are used, so there is no race unless nested TDP is used. Since nested TDP uses shadow pages, old shadow pages may be zapped by this function even when the TDP MMU is enabled. Since shadow pages are never allocated by kvm_tdp_mmu_map(), a race condition can be avoided by not calling make_mmu_pages_available() if the TDP MMU is currently in use. I encountered this when repeatedly starting and stopping nested VM. It can be artificially caused by allocating a large number of nested TDP SPTEs. For example, the following BUG and general protection fault are caused in the host kernel. pte_list_remove: 00000000cd54fc10 many->many ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/mmu.c:963! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:pte_list_remove.cold+0x16/0x48 [kvm] Call Trace: <TASK> drop_spte+0xe0/0x180 [kvm] mmu_page_zap_pte+0x4f/0x140 [kvm] __kvm_mmu_prepare_zap_page+0x62/0x3e0 [kvm] kvm_mmu_zap_oldest_mmu_pages+0x7d/0xf0 [kvm] direct_page_fault+0x3cb/0x9b0 [kvm] kvm_tdp_page_fault+0x2c/0xa0 [kvm] kvm_mmu_page_fault+0x207/0x930 [kvm] npf_interception+0x47/0xb0 [kvm_amd] svm_invoke_exit_handler+0x13c/0x1a0 [kvm_amd] svm_handle_exit+0xfc/0x2c0 [kvm_amd] kvm_arch_vcpu_ioctl_run+0xa79/0x1780 [kvm] kvm_vcpu_ioctl+0x29b/0x6f0 [kvm] __x64_sys_ioctl+0x95/0xd0 do_syscall_64+0x5c/0x90 general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:kvm_mmu_commit_zap_page.part.0+0x4b/0xe0 [kvm] Call Trace: <TASK> kvm_mmu_zap_oldest_mmu_pages+0xae/0xf0 [kvm] direct_page_fault+0x3cb/0x9b0 [kvm] kvm_tdp_page_fault+0x2c/0xa0 [kvm] kvm_mmu_page_fault+0x207/0x930 [kvm] npf_interception+0x47/0xb0 [kvm_amd] CVE: CVE-2022-45869 Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Signed-off-by: Kazuki Takiguchi <takiguchi.kazuki171@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23Merge tag 'pci-v6.1-fixes-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci fixes from Bjorn Helgaas: - Update MAINTAINERS to add Manivannan Sadhasivam as Qcom PCIe RC maintainer (replacing Stanimir Varbanov) and include DT PCI bindings in the "PCI native host bridge and endpoint drivers" entry. * tag 'pci-v6.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: MAINTAINERS: Include PCI bindings in host bridge entry MAINTAINERS: Add Manivannan Sadhasivam as Qcom PCIe RC maintainer
2022-11-23NFSD: Fix reads with a non-zero offset that don't end on a page boundaryChuck Lever
This was found when virtual machines with nfs-mounted qcow2 disks failed to boot properly. Reported-by: Anders Blomdell <anders.blomdell@control.lth.se> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2142132 Fixes: bfbfb6182ad1 ("nfsd_splice_actor(): handle compound pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-11-23Merge tag 'spi-fix-v6.1-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A few fixes, all device specific. The most important ones are for the i.MX driver which had a couple of nasty data corruption inducing errors appear after the change to support PIO mode in the last merge window (one introduced by the change and one latent one which the PIO changes exposed). Thanks to Frieder, Fabio, Marc and Marek for jumping on that and resolving the issues quickly once they were found" * tag 'spi-fix-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: spi-imx: spi_imx_transfer_one(): check for DMA transfer first spi: tegra210-quad: Fix duplicate resource error spi: dw-dma: decrease reference count in dw_spi_dma_init_mfld() spi: spi-imx: Fix spi_bus_clk if requested clock is higher than input clock spi: mediatek: Fix DEVAPC Violation at KO Remove
2022-11-23Merge tag '9p-for-6.1-rc7' of https://github.com/martinetd/linuxLinus Torvalds
Pull 9p fixes from Dominique Martinet: - 9p now uses a variable size for its recv buffer, but every place hadn't been updated properly to use it and some buffer overflows have been found and needed fixing. There's still one place where msize is incorrectly used in a safety check (p9_check_errors), but all paths leading to it should already be avoiding overflows and that patch took a bit more time to get right for zero-copy requests so I'll send it for 6.2 - yet another race condition in p9_conn_cancel introduced by a fix for a syzbot report in the same place. Maybe at some point we'll get it right without burning it all down... * tag '9p-for-6.1-rc7' of https://github.com/martinetd/linux: 9p/xen: check logical size for buffer size 9p/fd: Use P9_HDRSZ for header size 9p/fd: Fix write overflow in p9_read_work 9p/fd: fix issue of list_del corruption in p9_fd_cancel()
2022-11-23lib/vdso: use "grep -E" instead of "egrep"Greg Kroah-Hartman
The latest version of grep claims the egrep is now obsolete so the build now contains warnings that look like: egrep: warning: egrep is obsolescent; using grep -E fix this up by moving the vdso Makefile to use "grep -E" instead. Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lore.kernel.org/r/20220920170633.3133829-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-23fscache: fix OOB Read in __fscache_acquire_volumeDavid Howells
The type of a->key[0] is char in fscache_volume_same(). If the length of cache volume key is greater than 127, the value of a->key[0] is less than 0. In this case, klen becomes much larger than 255 after type conversion, because the type of klen is size_t. As a result, memcmp() is read out of bounds. This causes a slab-out-of-bounds Read in __fscache_acquire_volume(), as reported by Syzbot. Fix this by changing the type of the stored key to "u8 *" rather than "char *" (it isn't a simple string anyway). Also put in a check that the volume name doesn't exceed NAME_MAX. BUG: KASAN: slab-out-of-bounds in memcmp+0x16f/0x1c0 lib/string.c:757 Read of size 8 at addr ffff888016f3aa90 by task syz-executor344/3613 Call Trace: memcmp+0x16f/0x1c0 lib/string.c:757 memcmp include/linux/fortify-string.h:420 [inline] fscache_volume_same fs/fscache/volume.c:133 [inline] fscache_hash_volume fs/fscache/volume.c:171 [inline] __fscache_acquire_volume+0x76c/0x1080 fs/fscache/volume.c:328 fscache_acquire_volume include/linux/fscache.h:204 [inline] v9fs_cache_session_get_cookie+0x143/0x240 fs/9p/cache.c:34 v9fs_session_init+0x1166/0x1810 fs/9p/v9fs.c:473 v9fs_mount+0xba/0xc90 fs/9p/vfs_super.c:126 legacy_get_tree+0x105/0x220 fs/fs_context.c:610 vfs_get_tree+0x89/0x2f0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x1326/0x1e20 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568 Fixes: 62ab63352350 ("fscache: Implement volume registration") Reported-by: syzbot+a76f6a6e524cf2080aa3@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Zhang Peng <zhangpeng362@huawei.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Jeff Layton <jlayton@kernel.org> cc: v9fs-developer@lists.sourceforge.net cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/Y3OH+Dmi0QIOK18n@codewreck.org/ # Zhang Peng's v1 fix Link: https://lore.kernel.org/r/20221115140447.2971680-1-zhangpeng362@huawei.com/ # Zhang Peng's v2 fix Link: https://lore.kernel.org/r/166869954095.3793579.8500020902371015443.stgit@warthog.procyon.org.uk/ # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-11-23s390/dasd: fix possible buffer overflow in copy_pair_showStefan Haberland
dasd_copy_relation->entry[] array might be accessed out of bounds if the loop does not break. Fixes: a91ff09d39f9 ("s390/dasd: add copy pair setup") Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20221123160719.3002694-5-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23s390/dasd: fix no record found for raw_track_accessStefan Haberland
For DASD devices in raw_track_access mode only full track images are read and written. For this purpose it is not necessary to do search operation in the locate record extended function. The documentation even states that this might fail if the searched record is not found on a track. Currently the driver sets a value of 1 in the search field for the first record after record zero. This is the default for disks not in raw_track_access mode but record 1 might be missing on a completely empty track. There has not been any problem with this on IBM storage servers but it might lead to errors with DASD devices on other vendors storage servers. Fix this by setting the search field to 0. Record zero is always available even on a completely empty track. Fixes: e4dbb0f2b5dd ("[S390] dasd: Add support for raw ECKD access.") Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20221123160719.3002694-4-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23s390/dasd: increase printing of debug data payloadStefan Haberland
32 byte are to less for important data from prefix or other commands. Print up to 128 byte data. This is enough for the largest CCW data we have. Since printk can only print up to 1024 byte at once, print the different parts of the CCW dumps separately. Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20221123160719.3002694-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23s390/dasd: Fix spelling mistake "Ivalid" -> "Invalid"Colin Ian King
There is a spelling mistake in a pr_warn message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20220923132103.2486724-1-colin.i.king@gmail.com Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20221123160719.3002694-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23e100: Fix possible use after free in e100_xmit_prepareWang Hai
In e100_xmit_prepare(), if we can't map the skb, then return -ENOMEM, so e100_xmit_frame() will return NETDEV_TX_BUSY and the upper layer will resend the skb. But the skb is already freed, which will cause UAF bug when the upper layer resends the skb. Remove the harmful free. Fixes: 5e5d49422dfb ("e100: Release skb when DMA mapping is failed in e100_xmit_prepare") Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-23iavf: Fix error handling in iavf_init_module()Yuan Can
The iavf_init_module() won't destroy workqueue when pci_register_driver() failed. Call destroy_workqueue() when pci_register_driver() failed to prevent the resource leak. Similar to the handling of u132_hcd_init in commit f276e002793c ("usb: u132-hcd: fix resource leak") Fixes: 2803b16c10ea ("i40e/i40evf: Use private workqueue") Signed-off-by: Yuan Can <yuancan@huawei.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-23fm10k: Fix error handling in fm10k_init_module()Yuan Can
A problem about modprobe fm10k failed is triggered with the following log given: Intel(R) Ethernet Switch Host Interface Driver Copyright(c) 2013 - 2019 Intel Corporation. debugfs: Directory 'fm10k' with parent '/' already present! The reason is that fm10k_init_module() returns fm10k_register_pci_driver() directly without checking its return value, if fm10k_register_pci_driver() failed, it returns without removing debugfs and destroy workqueue, resulting the debugfs of fm10k can never be created later and leaks the workqueue. fm10k_init_module() alloc_workqueue() fm10k_dbg_init() # create debugfs fm10k_register_pci_driver() pci_register_driver() driver_register() bus_add_driver() priv = kzalloc(...) # OOM happened # return without remove debugfs and destroy workqueue Fix by remove debugfs and destroy workqueue when fm10k_register_pci_driver() returns error. Fixes: 7461fd913afe ("fm10k: Add support for debugfs") Fixes: b382bb1b3e2d ("fm10k: use separate workqueue for fm10k driver") Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-23i40e: Fix error handling in i40e_init_module()Shang XiaoJing
i40e_init_module() won't free the debugfs directory created by i40e_dbg_init() when pci_register_driver() failed. Add fail path to call i40e_dbg_exit() to remove the debugfs entries to prevent the bug. i40e: Intel(R) Ethernet Connection XL710 Network Driver i40e: Copyright (c) 2013 - 2019 Intel Corporation. debugfs: Directory 'i40e' with parent '/' already present! Fixes: 41c445ff0f48 ("i40e: main driver core") Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-23ixgbevf: Fix resource leak in ixgbevf_init_module()Shang XiaoJing
ixgbevf_init_module() won't destroy the workqueue created by create_singlethread_workqueue() when pci_register_driver() failed. Add destroy_workqueue() in fail path to prevent the resource leak. Similar to the handling of u132_hcd_init in commit f276e002793c ("usb: u132-hcd: fix resource leak") Fixes: 40a13e2493c9 ("ixgbevf: Use a private workqueue to avoid certain possible hangs") Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Reviewed-by: Saeed Mahameed <saeed@kernel.org> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-23Merge tag 'asoc-fix-v6.1-rc6' of ↵Takashi Iwai
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v6.1 A clutch of small fixes that have come in in the past week, people seem to have been unusually active for this late in the release cycle. The most critical one here is the fix to renumber the SOF DAI types in order to restore ABI compatibility which was broken by the addition of AMD support.
2022-11-23btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()Zhen Lei
Although kset_unregister() can eventually remove all attribute files, explicitly rolling back with the matching function makes the code logic look clearer. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-23btrfs: do not modify log tree while holding a leaf from fs tree lockedFilipe Manana
When logging an inode in full mode, or when logging xattrs or when logging the dir index items of a directory, we are modifying the log tree while holding a read lock on a leaf from the fs/subvolume tree. This can lead to a deadlock in rare circumstances, but it is a real possibility, and it was recently reported by syzbot with the following trace from lockdep: WARNING: possible circular locking dependency detected 6.1.0-rc5-next-20221116-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.1/16154 is trying to acquire lock: ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256 but task is already holding lock: ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-log-00){++++}-{3:3}: down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634 __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135 btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline] btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280 btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline] btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998 btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209 btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021 log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258 copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403 copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873 btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495 btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083 btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921 vfs_fsync_range+0x13e/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2856 [inline] iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128 btrfs_direct_write fs/btrfs/file.c:1536 [inline] btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668 call_write_iter include/linux/fs.h:2160 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #1 (btrfs-tree-00){++++}-{3:3}: __lock_release kernel/locking/lockdep.c:5382 [inline] lock_release+0x371/0x810 kernel/locking/lockdep.c:5688 up_write+0x2a/0x520 kernel/locking/rwsem.c:1614 btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238 search_leaf fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074 btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133 btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746 btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline] __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153 flush_space+0x147/0xe90 fs/btrfs/space-info.c:728 btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086 process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289 worker_thread+0x669/0x1090 kernel/workqueue.c:2436 kthread+0x2e8/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3097 [inline] check_prevs_add kernel/locking/lockdep.c:3216 [inline] validate_chain kernel/locking/lockdep.c:3831 [inline] __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055 lock_acquire kernel/locking/lockdep.c:5668 [inline] lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633 __mutex_lock_common kernel/locking/mutex.c:603 [inline] __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747 __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256 __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline] btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285 btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554 evict+0x2ed/0x6b0 fs/inode.c:664 dispose_list+0x117/0x1e0 fs/inode.c:697 prune_icache_sb+0xeb/0x150 fs/inode.c:896 super_cache_scan+0x391/0x590 fs/super.c:106 do_shrink_slab+0x464/0xce0 mm/vmscan.c:843 shrink_slab_memcg mm/vmscan.c:912 [inline] shrink_slab+0x388/0x660 mm/vmscan.c:991 shrink_node_memcgs mm/vmscan.c:6088 [inline] shrink_node+0x93d/0x1f30 mm/vmscan.c:6117 shrink_zones mm/vmscan.c:6355 [inline] do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417 try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732 reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393 mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578 try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816 try_charge mm/memcontrol.c:2827 [inline] charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889 __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910 mem_cgroup_charge include/linux/memcontrol.h:667 [inline] __filemap_add_folio+0x615/0xf80 mm/filemap.c:852 filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934 __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976 pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104 find_or_create_page include/linux/pagemap.h:612 [inline] alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588 btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline] btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988 __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440 btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595 btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038 btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137 update_log_root fs/btrfs/tree-log.c:2841 [inline] btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064 btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947 vfs_fsync_range+0x13e/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2856 [inline] iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128 btrfs_direct_write fs/btrfs/file.c:1536 [inline] btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668 call_write_iter include/linux/fs.h:2160 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-log-00); lock(btrfs-tree-00); lock(btrfs-log-00); lock(&delayed_node->mutex); Holding a read lock on a leaf from a fs/subvolume tree creates a nasty lock dependency when we are COWing extent buffers for the log tree and we have two tasks modifying the log tree, with each one in one of the following 2 scenarios: 1) Modifying the log tree triggers an extent buffer allocation while holding a write lock on a parent extent buffer from the log tree. Allocating the pages for an extent buffer, or the extent buffer struct, can trigger inode eviction and finally the inode eviction will trigger a release/remove of a delayed node, which requires taking the delayed node's mutex; 2) Allocating a metadata extent for a log tree can trigger the async reclaim thread and make us wait for it to release enough space and unblock our reservation ticket. The reclaim thread can start flushing delayed items, and that in turn results in the need to lock delayed node mutexes and in the need to write lock extent buffers of a subvolume tree - all this while holding a write lock on the parent extent buffer in the log tree. So one task in scenario 1) running in parallel with another task in scenario 2) could lead to a deadlock, one wanting to lock a delayed node mutex while having a read lock on a leaf from the subvolume, while the other is holding the delayed node's mutex and wants to write lock the same subvolume leaf for flushing delayed items. Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the fs/subvolume leaf and use the clone leaf instead. Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/ CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-23btrfs: use kvcalloc in btrfs_get_dev_zone_infoChristoph Hellwig
Otherwise the kernel memory allocator seems to be unhappy about failing order 6 allocations for the zones array, that cause 100% reproducible mount failures in my qemu setup: [26.078981] mount: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) [26.079741] CPU: 0 PID: 2965 Comm: mount Not tainted 6.1.0-rc5+ #185 [26.080181] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [26.080950] Call Trace: [26.081132] <TASK> [26.081291] dump_stack_lvl+0x56/0x6f [26.081554] warn_alloc+0x117/0x140 [26.081808] ? __alloc_pages_direct_compact+0x1b5/0x300 [26.082174] __alloc_pages_slowpath.constprop.0+0xd0e/0xde0 [26.082569] __alloc_pages+0x32a/0x340 [26.082836] __kmalloc_large_node+0x4d/0xa0 [26.083133] ? trace_kmalloc+0x29/0xd0 [26.083399] kmalloc_large+0x14/0x60 [26.083654] btrfs_get_dev_zone_info+0x1b9/0xc00 [26.083980] ? _raw_spin_unlock_irqrestore+0x28/0x50 [26.084328] btrfs_get_dev_zone_info_all_devices+0x54/0x80 [26.084708] open_ctree+0xed4/0x1654 [26.084974] btrfs_mount_root.cold+0x12/0xde [26.085288] ? lock_is_held_type+0xe2/0x140 [26.085603] legacy_get_tree+0x28/0x50 [26.085876] vfs_get_tree+0x1d/0xb0 [26.086139] vfs_kern_mount.part.0+0x6c/0xb0 [26.086456] btrfs_mount+0x118/0x3a0 [26.086728] ? lock_is_held_type+0xe2/0x140 [26.087043] legacy_get_tree+0x28/0x50 [26.087323] vfs_get_tree+0x1d/0xb0 [26.087587] path_mount+0x2ba/0xbe0 [26.087850] ? _raw_spin_unlock_irqrestore+0x38/0x50 [26.088217] __x64_sys_mount+0xfe/0x140 [26.088506] do_syscall_64+0x35/0x80 [26.088776] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>