summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-07-11Merge branch 'mptcp-fixes'David S. Miller
Mat Martineau says: ==================== mptcp: Disconnect and selftest fixes Patch 1 switches to a safe list iterator in the MPTCP disconnect code. Patch 2 adds the userspace_pm.sh selftest script to the MPTCP selftest Makefile, resolving the netdev/check_selftest CI failure. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-11selftests: mptcp: validate userspace PM tests by defaultMatthieu Baerts
The new script was not listed in the programs to test. By consequence, some CIs running MPTCP selftests were not validating these new tests. Note that MPTCP CI was validating it as it executes all .sh scripts from 'tools/testing/selftests/net/mptcp' directory. Fixes: 259a834fadda ("selftests: mptcp: functional tests for the userspace PM type") Reported-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-11mptcp: fix subflow traversal at disconnect timePaolo Abeni
At disconnect time the MPTCP protocol traverse the subflows list closing each of them. In some circumstances - MPJ subflow, passive MPTCP socket, the latter operation can remove the subflow from the list, invalidating the current iterator. Address the issue using the safe list traversing helper variant. Reported-by: van fantasy <g1042620637@gmail.com> Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation") Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-11wifi: mac80211: fix queue selection for mesh/OCB interfacesFelix Fietkau
When using iTXQ, the code assumes that there is only one vif queue for broadcast packets, using the BE queue. Allowing non-BE queue marking violates that assumption and txq->ac == skb_queue_mapping is no longer guaranteed. This can cause issues with queue handling in the driver and also causes issues with the recent ATF change, resulting in an AQL underflow warning. Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20220702145227.39356-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-07-10Linux 5.19-rc6v5.19-rc6Linus Torvalds
2022-07-10Merge branch 'hot-fixes' (fixes for rc6)Linus Torvalds
This is a collection of three fixes for small annoyances. Two of these are already pending in other trees, but I really don't want to release another -rc with these issues pending, so I picked up the patches for these things directly. We'll end up with duplicate commits eventually, I prefer that over having these issues pending. The third one is just me getting rid of another BUG_ON() just because it was reported and I dislike those things so much. * merge 'hot-fixes' branch: ida: don't use BUG_ON() for debugging drm/aperture: Run fbdev removal before internal helpers ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()
2022-07-10ida: don't use BUG_ON() for debuggingLinus Torvalds
This is another old BUG_ON() that just shouldn't exist (see also commit a382f8fee42c: "signal handling: don't use BUG_ON() for debugging"). In fact, as Matthew Wilcox points out, this condition shouldn't really even result in a warning, since a negative id allocation result is just a normal allocation failure: "I wonder if we should even warn here -- sure, the caller is trying to free something that wasn't allocated, but we don't warn for kfree(NULL)" and goes on to point out how that current error check is only causing people to unnecessarily do their own index range checking before freeing it. This was noted by Itay Iellin, because the bluetooth HCI socket cookie code does *not* do that range checking, and ends up just freeing the error case too, triggering the BUG_ON(). The HCI code requires CAP_NET_RAW, and seems to just result in an ugly splat, but there really is no reason to BUG_ON() here, and we have generally striven for allocation models where it's always ok to just do free(alloc()); even if the allocation were to fail for some random reason (usually obviously that "random" reason being some resource limit). Fixes: 88eca0207cf1 ("ida: simplified functions for id allocation") Reported-by: Itay Iellin <ieitayie@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-10Merge tag 'dmaengine-fix-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine fixes from Vinod Koul: "One core fix for DMA_INTERRUPT and rest driver fixes. Core: - Revert verification of DMA_INTERRUPT capability as that was incorrect Bunch of driver fixes for: - ti: refcount and put_device leak - qcom_bam: runtime pm overflow - idxd: force wq context cleanup and call idxd_enable_system_pasid() on success - dw-axi-dmac: RMW on channel suspend register - imx-sdma: restart cyclic channel when enabled - at_xdma: error handling for at_xdmac_alloc_desc - pl330: lockdep warning - lgm: error handling path in probe - allwinner: Fix min/max typo in binding" * tag 'dmaengine-fix-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: dt-bindings: dma: allwinner,sun50i-a64-dma: Fix min/max typo dmaengine: lgm: Fix an error handling path in intel_ldma_probe() dmaengine: pl330: Fix lockdep warning about non-static key dmaengine: idxd: Only call idxd_enable_system_pasid() if succeeded in enabling SVA feature dmaengine: at_xdma: handle errors of at_xdmac_alloc_desc() correctly dmaengine: imx-sdma: only restart cyclic channel when enabled dmaengine: dw-axi-dmac: Fix RMW on channel suspend register dmaengine: idxd: force wq context cleanup on device disable path dmaengine: qcom: bam_dma: fix runtime PM underflow dmaengine: imx-sdma: Allow imx8m for imx7 FW revs dmaengine: Revert "dmaengine: add verification of DMA_INTERRUPT capability for dmatest" dmaengine: ti: Add missing put_device in ti_dra7_xbar_route_allocate dmaengine: ti: Fix refcount leak in ti_dra7_xbar_route_allocate
2022-07-10Merge tag 'staging-5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fix from Greg KH: "Here is a single staging driver fix for a reported problem that showed up in 5.19-rc1 in the wlan-ng driver. It has been in linux-next for a week with no reported problems" * tag 'staging-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging/wlan-ng: get the correct struct hfa384x in work callback
2022-07-10Merge tag 'char-misc-5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are four small char/misc driver fixes for 5.19-rc6 to resolve some reported issues. They only affect two drivers: - rtsx_usb: fix for of-reported DMA warning error, the driver was handling memory buffers in odd ways, it has now been fixed up to be much simpler and correct by Shuah. - at25 eeprom driver bugfix for reported problem All of these have been in linux-next for a week with no reported problems" * tag 'char-misc-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: misc: rtsx_usb: set return value in rsp_buf alloc err path misc: rtsx_usb: use separate command and response buffers misc: rtsx_usb: fix use of dma mapped buffer for usb bulk transfer eeprom: at25: Rework buggy read splitting
2022-07-10Merge tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fix from Jens Axboe: "A single fix for an issue that came up yesterday that we should plug for -rc6. This is a regression introduced in this cycle" * tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-block: io_uring: check that we have a file table when allocating update slots
2022-07-10Merge tag 'kbuild-fixes-v5.19-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Adjust gen_compile_commands.py to the format change of *.mod files - Remove unused macro in scripts/Makefile.modinst * tag 'kbuild-fixes-v5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: remove unused cmd_none in scripts/Makefile.modinst gen_compile_commands: handle multiple lines per .mod file
2022-07-10Merge tag 'irq_urgent_for_v5.19_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Gracefully handle failure to request MMIO resources in the GICv3 driver - Make a static key static in the Apple AIC driver - Fix the Xilinx intc driver dependency on OF_ADDRESS * tag 'irq_urgent_for_v5.19_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/apple-aic: Make symbol 'use_fast_ipi' static irqchip/xilinx: Add explicit dependency on OF_ADDRESS irqchip/gicv3: Handle resource request failure consistently
2022-07-10Merge tag 'x86_urgent_for_v5.19_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Prepare for and clear .brk early in order to address XenPV guests failures where the hypervisor verifies page tables and uninitialized data in that range leads to bogus failures in those checks - Add any potential setup_data entries supplied at boot to the identity pagetable mappings to prevent kexec kernel boot failures. Usually, this is not a problem for the normal kernel as those mappings are part of the initially mapped 2M pages but if kexec gets to allocate the second kernel somewhere else, those setup_data entries need to be mapped there too. - Fix objtool not to discard text references from the __tracepoints section so that ENDBR validation still works - Correct the setup_data types limit as it is user-visible, before 5.19 releases * tag 'x86_urgent_for_v5.19_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Fix the setup data types max limit x86/ibt, objtool: Don't discard text references from tracepoint section x86/compressed/64: Add identity mappings for setup_data entries x86: Fix .brk attribute in linker script x86: Clear .brk area at early boot x86/xen: Use clear_bss() for Xen PV guests
2022-07-10kbuild: remove unused cmd_none in scripts/Makefile.modinstMasahiro Yamada
Commit 65ce9c38326e ("kbuild: move module strip/compression code into scripts/Makefile.modinst") added this unused code. Perhaps, I thought cmd_none was useful for CONFIG_MODULE_COMPRESS_NONE, but I did not use it after all. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2022-07-10x86/boot: Fix the setup data types max limitBorislav Petkov
Commit in Fixes forgot to change the SETUP_TYPE_MAX definition which contains the highest valid setup data type. Correct that. Fixes: 5ea98e01ab52 ("x86/boot: Add Confidential Computing type to setup_data") Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/ddba81dd-cc92-699c-5274-785396a17fb5@zytor.com
2022-07-09Merge tag 'i2c-for-5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Two I2C driver bugfixes preventing resource leaks" * tag 'i2c-for-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: cadence: Unregister the clk notifier in error path i2c: piix4: Fix a memory leak in the EFCH MMIO support
2022-07-09drm/aperture: Run fbdev removal before internal helpersThomas Zimmermann
Always run fbdev removal first to remove simpledrm via sysfb_disable(). This clears the internal state. The later call to drm_aperture_detach_drivers() then does nothing. Otherwise, with drm_aperture_detach_drivers() running first, the call to sysfb_disable() uses inconsistent state. Example backtrace show below: BUG: KASAN: use-after-free in device_del+0x79/0x5f0 Read of size 8 at addr ffff888108185050 by task systemd-udevd/311 CPU: 0 PID: 311 Comm: systemd-udevd Tainted: G E 5.19.0-rc2-1-default+ #1689 Hardware name: HP ProLiant DL120 G7, BIOS J01 04/21/2011 Call Trace: device_del+0x79/0x5f0 platform_device_del.part.0+0x19/0xe0 platform_device_unregister+0x1c/0x30 sysfb_disable+0x2d/0x70 remove_conflicting_framebuffers+0x1c/0xf0 remove_conflicting_pci_framebuffers+0x130/0x1a0 drm_aperture_remove_conflicting_pci_framebuffers+0x86/0xb0 mgag200_pci_probe+0x2d/0x140 [mgag200] Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 873eb3b11860 ("fbdev: Disable sysfb device registration when removing conflicting FBs") Cc: Javier Martinez Canillas <javierm@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Helge Deller <deller@gmx.de> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Changcheng Deng <deng.changcheng@zte.com.cn> Reviewed-by: Zack Rusin <zackr@vmware.com> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-09ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()Sven Schnelle
CI reported the following splat while running the strace testsuite: WARNING: CPU: 1 PID: 3570031 at kernel/ptrace.c:272 ptrace_check_attach+0x12e/0x178 CPU: 1 PID: 3570031 Comm: strace Tainted: G OE 5.19.0-20220624.rc3.git0.ee819a77d4e7.300.fc36.s390x #1 Hardware name: IBM 3906 M04 704 (z/VM 7.1.0) Call Trace: [<00000000ab4b645a>] ptrace_check_attach+0x132/0x178 ([<00000000ab4b6450>] ptrace_check_attach+0x128/0x178) [<00000000ab4b6cde>] __s390x_sys_ptrace+0x86/0x160 [<00000000ac03fcec>] __do_syscall+0x1d4/0x200 [<00000000ac04e312>] system_call+0x82/0xb0 Last Breaking-Event-Address: [<00000000ab4ea3c8>] wait_task_inactive+0x98/0x190 This is because JOBCTL_TRACED is set, but the task is not in TASK_TRACED state. Caused by ptrace_unfreeze_traced() which does: task->jobctl &= ~TASK_TRACED but it should be: task->jobctl &= ~JOBCTL_TRACED Fixes: 31cae1eaae4f ("sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-09Merge tag 'powerpc-5.19-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: - On Power8 bare metal, fix creation of RNG platform devices, which are needed for the /dev/hwrng driver to probe correctly. Thanks to Jason A. Donenfeld, and Sachin Sant. * tag 'powerpc-5.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/powernv: delay rng platform device creation until later in boot
2022-07-09Merge tag 'asoc-fix-v5.19-rc4' of ↵Takashi Iwai
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v5.19 Quite a large batch due to things building up for a couple of weeks but all driver specific apart from Marek's documentation fix.
2022-07-09netfilter: nf_tables: replace BUG_ON by element length checkPablo Neira Ayuso
BUG_ON can be triggered from userspace with an element with a large userdata area. Replace it by length check and return EINVAL instead. Over time extensions have been growing in size. Pick a sufficiently old Fixes: tag to propagate this fix. Fixes: 7d7402642eaf ("netfilter: nf_tables: variable sized set element keys / data") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-07-09io_uring: check that we have a file table when allocating update slotsJens Axboe
If IORING_FILE_INDEX_ALLOC is set asking for an allocated slot, the helper doesn't check if we actually have a file table or not. The non alloc path does do that correctly, and returns -ENXIO if we haven't set one up. Do the same for the allocated path, avoiding a NULL pointer dereference when trying to find a free bit. Fixes: a7c41b4687f5 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-09vlan: fix memory leak in vlan_newlink()Eric Dumazet
Blamed commit added back a bug I fixed in commit 9bbd917e0bec ("vlan: fix memory leak in vlan_dev_set_egress_priority") If a memory allocation fails in vlan_changelink() after other allocations succeeded, we need to call vlan_dev_free_egress_priority() to free all allocated memory because after a failed ->newlink() we do not call any methods like ndo_uninit() or dev->priv_destructor(). In following example, if the allocation for last element 2000:2001 fails, we need to free eight prior allocations: ip link add link dummy0 dummy0.100 type vlan id 100 \ egress-qos-map 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9 2000:2001 syzbot report was: BUG: memory leak unreferenced object 0xffff888117bd1060 (size 32): comm "syz-executor408", pid 3759, jiffies 4294956555 (age 34.090s) hex dump (first 32 bytes): 09 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff83fc60ad>] kmalloc include/linux/slab.h:600 [inline] [<ffffffff83fc60ad>] vlan_dev_set_egress_priority+0xed/0x170 net/8021q/vlan_dev.c:193 [<ffffffff83fc6628>] vlan_changelink+0x178/0x1d0 net/8021q/vlan_netlink.c:128 [<ffffffff83fc67c8>] vlan_newlink+0x148/0x260 net/8021q/vlan_netlink.c:185 [<ffffffff838b1278>] rtnl_newlink_create net/core/rtnetlink.c:3363 [inline] [<ffffffff838b1278>] __rtnl_newlink+0xa58/0xdc0 net/core/rtnetlink.c:3580 [<ffffffff838b1629>] rtnl_newlink+0x49/0x70 net/core/rtnetlink.c:3593 [<ffffffff838ac66c>] rtnetlink_rcv_msg+0x21c/0x5c0 net/core/rtnetlink.c:6089 [<ffffffff839f9c37>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2501 [<ffffffff839f8da7>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] [<ffffffff839f8da7>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345 [<ffffffff839f9266>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921 [<ffffffff8384dbf6>] sock_sendmsg_nosec net/socket.c:714 [inline] [<ffffffff8384dbf6>] sock_sendmsg+0x56/0x80 net/socket.c:734 [<ffffffff8384e15c>] ____sys_sendmsg+0x36c/0x390 net/socket.c:2488 [<ffffffff838523cb>] ___sys_sendmsg+0x8b/0xd0 net/socket.c:2542 [<ffffffff838525b8>] __sys_sendmsg net/socket.c:2571 [inline] [<ffffffff838525b8>] __do_sys_sendmsg net/socket.c:2580 [inline] [<ffffffff838525b8>] __se_sys_sendmsg net/socket.c:2578 [inline] [<ffffffff838525b8>] __x64_sys_sendmsg+0x78/0xf0 net/socket.c:2578 [<ffffffff845ad8d5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845ad8d5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 37aa50c539bc ("vlan: introduce vlan_dev_free_egress_priority") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-09nfp: fix issue of skb segments exceeds descriptor limitationBaowen Zheng
TCP packets will be dropped if the segments number in the tx skb exceeds limitation when sending iperf3 traffic with --zerocopy option. we make the following changes: Get nr_frags in nfp_nfdk_tx_maybe_close_block instead of passing from outside because it will be changed after skb_linearize operation. Fill maximum dma_len in first tx descriptor to make sure the whole head is included in the first descriptor. Fixes: c10d12e3dce8 ("nfp: add support for NFDK data path") Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-09x86/speculation: Disable RRSBA behaviorPawan Gupta
Some Intel processors may use alternate predictors for RETs on RSB-underflow. This condition may be vulnerable to Branch History Injection (BHI) and intramode-BTI. Kernel earlier added spectre_v2 mitigation modes (eIBRS+Retpolines, eIBRS+LFENCE, Retpolines) which protect indirect CALLs and JMPs against such attacks. However, on RSB-underflow, RET target prediction may fallback to alternate predictors. As a result, RET's predicted target may get influenced by branch history. A new MSR_IA32_SPEC_CTRL bit (RRSBA_DIS_S) controls this fallback behavior when in kernel mode. When set, RETs will not take predictions from alternate predictors, hence mitigating RETs as well. Support for this is enumerated by CPUID.7.2.EDX[RRSBA_CTRL] (bit2). For spectre v2 mitigation, when a user selects a mitigation that protects indirect CALLs and JMPs against BHI and intramode-BTI, set RRSBA_DIS_S also to protect RETs for RSB-underflow case. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-07-09x86/kexec: Disable RET on kexecKonrad Rzeszutek Wilk
All the invocations unroll to __x86_return_thunk and this file must be PIC independent. This fixes kexec on 64-bit AMD boxes. [ bp: Fix 32-bit build. ] Reported-by: Edward Tran <edward.tran@oracle.com> Reported-by: Awais Tanveer <awais.tanveer@oracle.com> Suggested-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-07-09netfilter: nf_log: incorrect offset to network headerPablo Neira Ayuso
NFPROTO_ARP is expecting to find the ARP header at the network offset. In the particular case of ARP, HTYPE= field shows the initial bytes of the ethernet header destination MAC address. netdev out: IN= OUT=bridge0 MACSRC=c2:76:e5:71:e1:de MACDST=36:b0:4a:e2:72:ea MACPROTO=0806 ARP HTYPE=14000 PTYPE=0x4ae2 OPCODE=49782 NFPROTO_NETDEV egress hook is also expecting to find the IP headers at the network offset. Fixes: 35b9395104d5 ("netfilter: add generic ARP packet logger") Reported-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-07-08Merge branch 'selftests-forwarding-install-two-missing-tests'Jakub Kicinski
Martin Blumenstingl says: ==================== selftests: forwarding: Install two missing tests For some distributions (e.g. OpenWrt) we don't want to rely on rsync to copy the tests to the target as some extra dependencies need to be installed. The Makefile in tools/testing/selftests/net/forwarding already installs most of the tests. This series adds the two missing tests to the list of installed tests. That way a downstream distribution can build a package using this Makefile (and add dependencies there as needed). ==================== Link: https://lore.kernel.org/r/20220707135532.1783925-1-martin.blumenstingl@googlemail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08selftests: forwarding: Install no_forwarding.shMartin Blumenstingl
When using the Makefile from tools/testing/selftests/net/forwarding/ all tests should be installed. Add no_forwarding.sh to the list of "to be installed tests" where it has been missing so far. Fixes: 476a4f05d9b83f ("selftests: forwarding: add a no_forwarding.sh test") Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08selftests: forwarding: Install local_termination.shMartin Blumenstingl
When using the Makefile from tools/testing/selftests/net/forwarding/ all tests should be installed. Add local_termination.sh to the list of "to be installed tests" where it has been missing so far. Fixes: 90b9566aa5cd3f ("selftests: forwarding: add a test for local_termination.sh") Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08Merge tag 'fscache-fixes-20220708' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull fscache fixes from David Howells: - Fix a check in fscache_wait_on_volume_collision() in which the polarity is reversed. It should complain if a volume is still marked acquisition-pending after 20s, but instead complains if the mark has been cleared (ie. the condition has cleared). Also switch an open-coded test of the ACQUIRE_PENDING volume flag to use the helper function for consistency. - Not a fix per se, but neaten the code by using a helper to check for the DROPPED state. - Fix cachefiles's support for erofs to only flush requests associated with a released control file, not all requests. - Fix a race between one process invalidating an object in the cache and another process trying to look it up. * tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: fscache: Fix invalidation/lookup race cachefiles: narrow the scope of flushed requests when releasing fd fscache: Introduce fscache_cookie_is_dropped() fscache: Fix if condition in fscache_wait_on_volume_collision()
2022-07-08Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== bpf 2022-07-08 We've added 3 non-merge commits during the last 2 day(s) which contain a total of 7 files changed, 40 insertions(+), 24 deletions(-). The main changes are: 1) Fix cBPF splat triggered by skb not having a mac header, from Eric Dumazet. 2) Fix spurious packet loss in generic XDP when pushing packets out (note that native XDP is not affected by the issue), from Johan Almbladh. 3) Fix bpf_dynptr_{read,write}() helper signatures with flag argument before its set in stone as UAPI, from Joanne Koong. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs bpf: Make sure mac_header was set before using it xdp: Fix spurious packet loss in generic XDP TX path ==================== Link: https://lore.kernel.org/r/20220708213418.19626-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()Sven Schnelle
CI reported the following splat while running the strace testsuite: [ 3976.640309] WARNING: CPU: 1 PID: 3570031 at kernel/ptrace.c:272 ptrace_check_attach+0x12e/0x178 [ 3976.640391] CPU: 1 PID: 3570031 Comm: strace Tainted: G OE 5.19.0-20220624.rc3.git0.ee819a77d4e7.300.fc36.s390x #1 [ 3976.640410] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0) [ 3976.640452] Call Trace: [ 3976.640454] [<00000000ab4b645a>] ptrace_check_attach+0x132/0x178 [ 3976.640457] ([<00000000ab4b6450>] ptrace_check_attach+0x128/0x178) [ 3976.640460] [<00000000ab4b6cde>] __s390x_sys_ptrace+0x86/0x160 [ 3976.640463] [<00000000ac03fcec>] __do_syscall+0x1d4/0x200 [ 3976.640468] [<00000000ac04e312>] system_call+0x82/0xb0 [ 3976.640470] Last Breaking-Event-Address: [ 3976.640471] [<00000000ab4ea3c8>] wait_task_inactive+0x98/0x190 This is because JOBCTL_TRACED is set, but the task is not in TASK_TRACED state. Caused by ptrace_unfreeze_traced() which does: task->jobctl &= ~TASK_TRACED but it should be: task->jobctl &= ~JOBCTL_TRACED Fixes: 31cae1eaae4f ("sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Link: https://lkml.kernel.org/r/20220706101625.2100298-1-svens@linux.ibm.com Link: https://lkml.kernel.org/r/YrHA5UkJLornOdCz@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.com Link: https://bugzilla.redhat.com/show_bug.cgi?id=2101641 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Gordeev <agordeev@linux.ibm.com> Tested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-07-08Merge tag 'acpi-5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These fix two recent regressions related to CPPC support. Specifics: - Prevent _CPC from being used if the platform firmware does not confirm CPPC v2 support via _OSC (Mario Limonciello) - Allow systems with X86_FEATURE_CPPC set to use _CPC even if CPPC support cannot be agreed on via _OSC (Mario Limonciello)" * tag 'acpi-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: CPPC: Don't require _OSC if X86_FEATURE_CPPC is supported ACPI: CPPC: Only probe for _CPC if CPPC v2 is acked
2022-07-08Merge tag 'pm-5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a NULL pointer dereference in a devfreq driver and a runtime PM framework issue that may cause a supplier device to be suspended before its consumer. Specifics: - Fix NULL pointer dereference related to printing a diagnostic message in the exynos-bus devfreq driver (Christian Marangi) - Fix race condition in the runtime PM framework which in some cases may cause a supplier device to be suspended when its consumer is still active (Rafael Wysocki)" * tag 'pm-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / devfreq: exynos-bus: Fix NULL pointer dereference PM: runtime: Fix supplier device management during consumer probe PM: runtime: Redefine pm_runtime_release_supplier()
2022-07-08Merge tag 'cxl-fixes-for-5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Vishal Verma: - Update MAINTAINERS for Ben's email - Fix cleanup of port devices on failure to probe driver - Fix endianness in get/set LSA mailbox command structures - Fix memregion_free() fallback definition - Fix missing variable payload checks in CXL cmd size validation * tag 'cxl-fixes-for-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/mbox: Fix missing variable payload checks in cmd size validation memregion: Fix memregion_free() fallback definition cxl/mbox: Use __le32 in get,set_lsa mailbox structures cxl/core: Use is_endpoint_decoder cxl: Fix cleanup of port devices on failure to probe driver. MAINTAINERS: Update Ben's email address
2022-07-08Merge tag 'iommu-fixes-v5.19-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - fix device setup failures in the Intel VT-d driver when the PASID table is shared - fix Intel VT-d device hot-add failure due to wrong device notifier order - remove the old IOMMU mailing list from the MAINTAINERS file now that it has been retired * tag 'iommu-fixes-v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: MAINTAINERS: Remove iommu@lists.linux-foundation.org iommu/vt-d: Fix RID2PASID setup/teardown failure iommu/vt-d: Fix PCI bus rescan device hot add
2022-07-08Merge tag 'gpio-fixes-for-v5.19-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix a build error in gpio-vf610 - fix a null-pointer dereference in the GPIO character device code * tag 'gpio-fixes-for-v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpiolib: cdev: fix null pointer dereference in linereq_free() gpio: vf610: fix compilation error
2022-07-08Merge branch 'pm-core'Rafael J. Wysocki
Merge a runtime PM framework cleanup and fix related to device links. * pm-core: PM: runtime: Fix supplier device management during consumer probe PM: runtime: Redefine pm_runtime_release_supplier()
2022-07-08Merge tag 'block-5.19-2022-07-08' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "NVMe pull request with another id quirk addition, and a tracing fix" * tag 'block-5.19-2022-07-08' of git://git.kernel.dk/linux-block: nvme: use struct group for generic command dwords nvme-pci: phison e16 has bogus namespace ids
2022-07-08Merge tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring tweak from Jens Axboe: "Just a minor tweak to an addition made in this release cycle: padding a 32-bit value that's in a 64-bit union to avoid any potential funkiness from that" * tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block: io_uring: explicit sqe padding for ioctl commands
2022-07-08Merge tag 'for-5.19/fbdev-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes from Helge Deller: - fbcon now prevents switching to screen resolutions which are smaller than the font size, and prevents enabling a font which is bigger than the current screen resolution. This fixes vmalloc-out-of-bounds accesses found by KASAN. - Guiling Deng fixed a bug where the centered fbdev logo wasn't displayed correctly if the screen size matched the logo size. - Hsin-Yi Wang provided a patch to include errno.h to fix build when CONFIG_OF isn't enabled. * tag 'for-5.19/fbdev-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbcon: Use fbcon_info_from_console() in fbcon_modechange_possible() fbmem: Check virtual screen sizes in fb_set_var() fbcon: Prevent that screen size is smaller than font size fbcon: Disallow setting font bigger than screen size video: of_display_timing.h: include errno.h fbdev: fbmem: Fix logo center image dx issue
2022-07-08btrfs: zoned: drop optimization of zone finishNaohiro Aota
We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we wrote fully into the zone. The assumption is determined by "alloc_offset == capacity". This condition won't work if the last ordered extent is canceled due to some errors. In that case, we consider the zone is deactivated without sending the finish command while it's still active. This inconstancy results in activating another block group while we cannot really activate the underlying zone, which causes the active zone exceeds errors like below. BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0 nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0 nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0 Fix the issue by removing the optimization for now. Fixes: 8376d9e1ed8f ("btrfs: zoned: finish superblock zone once no space left for new SB") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08btrfs: zoned: fix a leaked bioc in read_zone_infoChristoph Hellwig
The bioc would leak on the normal completion path and also on the RAID56 check (but that one won't happen in practice due to the invalid combination with zoned mode). Fixes: 7db1c5d14dcd ("btrfs: zoned: support dev-replace in zoned filesystems") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline ↵Filipe Manana
extents When doing a direct IO read or write, we always return -ENOTBLK when we find a compressed extent (or an inline extent) so that we fallback to buffered IO. This however is not ideal in case we are in a NOWAIT context (io_uring for example), because buffered IO can block and we currently have no support for NOWAIT semantics for buffered IO, so if we need to fallback to buffered IO we should first signal the caller that we may need to block by returning -EAGAIN instead. This behaviour can also result in short reads being returned to user space, which although it's not incorrect and user space should be able to deal with partial reads, it's somewhat surprising and even some popular applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't deal with short reads properly (or at all). The short read case happens when we try to read from a range that has a non-compressed and non-inline extent followed by a compressed extent. After having read the first extent, when we find the compressed extent we return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to treat the request as a short read, returning 0 (success) and waiting for previously submitted bios to complete (this happens at fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at btrfs_file_read_iter(), we call filemap_read() to use buffered IO to read the remaining data, and pass it the number of bytes we were able to read with direct IO. Than at filemap_read() if we get a page fault error when accessing the read buffer, we return a partial read instead of an -EFAULT error, because the number of bytes previously read is greater than zero. So fix this by returning -EAGAIN for NOWAIT direct IO when we find a compressed or an inline extent. Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/ Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582 Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08ovl: turn of SB_POSIXACL with idmapped layers temporarilyChristian Brauner
This cycle we added support for mounting overlayfs on top of idmapped mounts. Recently I've started looking into potential corner cases when trying to add additional tests and I noticed that reporting for POSIX ACLs is currently wrong when using idmapped layers with overlayfs mounted on top of it. I have sent out an patch that fixes this and makes POSIX ACLs work correctly but the patch is a bit bigger and we're already at -rc5 so I recommend we simply don't raise SB_POSIXACL when idmapped layers are used. Then we can fix the VFS part described below for the next merge window so we can have good exposure in -next. I'm going to give a rather detailed explanation to both the origin of the problem and mention the solution so people know what's going on. Let's assume the user creates the following directory layout and they have a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would expect files on your host system to be owned. For example, ~/.bashrc for your regular user would be owned by 1000:1000 and /root/.bashrc would be owned by 0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs filesystem. The user chooses to set POSIX ACLs using the setfacl binary granting the user with uid 4 read, write, and execute permissions for their .bashrc file: setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc Now they to expose the whole rootfs to a container using an idmapped mount. So they first create: mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap} mkdir -pv /vol/contpool/ctrover/{over,work} chown 10000000:10000000 /vol/contpool/ctrover/{over,work} The user now creates an idmapped mount for the rootfs: mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \ /var/lib/lxc/c2/rootfs \ /vol/contpool/lowermap This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at /vol/contpool/lowermap/home/ubuntu/.bashrc. Assume the user wants to expose these idmapped mounts through an overlayfs mount to a container. mount -t overlay overlay \ -o lowerdir=/vol/contpool/lowermap, \ upperdir=/vol/contpool/overmap/over, \ workdir=/vol/contpool/overmap/work \ /vol/contpool/merge The user can do this in two ways: (1) Mount overlayfs in the initial user namespace and expose it to the container. (2) Mount overlayfs on top of the idmapped mounts inside of the container's user namespace. Let's assume the user chooses the (1) option and mounts overlayfs on the host and then changes into a container which uses the idmapping 0:10000000:65536 which is the same used for the two idmapped mounts. Now the user tries to retrieve the POSIX ACLs using the getfacl command getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc and to their surprise they see: # file: vol/contpool/merge/home/ubuntu/.bashrc # owner: 1000 # group: 1000 user::rw- user:4294967295:rwx group::r-- mask::rwx other::r-- indicating the uid wasn't correctly translated according to the idmapped mount. The problem is how we currently translate POSIX ACLs. Let's inspect the callchain in this example: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get == ovl_posix_acl_xattr_get() | -> ovl_xattr_get() | -> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get() /* lower filesystem callback */ |> posix_acl_fix_xattr_to_user() { 4 = make_kuid(&init_user_ns, 4); 4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4); /* FAILURE */ -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); } If the user chooses to use option (2) and mounts overlayfs on top of idmapped mounts inside the container things don't look that much better: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get == ovl_posix_acl_xattr_get() | -> ovl_xattr_get() | -> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get() /* lower filesystem callback */ |> posix_acl_fix_xattr_to_user() { 4 = make_kuid(&init_user_ns, 4); 4 = mapped_kuid_fs(&init_user_ns, 4); /* FAILURE */ -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); } As is easily seen the problem arises because the idmapping of the lower mount isn't taken into account as all of this happens in do_gexattr(). But do_getxattr() is always called on an overlayfs mount and inode and thus cannot possible take the idmapping of the lower layers into account. This problem is similar for fscaps but there the translation happens as part of vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain: setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc The expected outcome here is that we'll receive the cap_net_raw capability as we are able to map the uid associated with the fscap to 0 within our container. IOW, we want to see 0 as the result of the idmapping translations. If the user chooses option (1) we get the following callchain for fscaps: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() -> vfs_getxattr() -> xattr_getsecurity() -> security_inode_getsecurity() ________________________________ -> cap_inode_getsecurity() | | { V | 10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | /* Expected result is 0 and thus that we own the fscap. */ | 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | } | -> vfs_getxattr_alloc() | -> handler->get == ovl_other_xattr_get() | -> vfs_getxattr() | -> xattr_getsecurity() | -> security_inode_getsecurity() | -> cap_inode_getsecurity() | { | 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | 10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | |____________________________________________________________________| } -> vfs_getxattr_alloc() -> handler->get == /* lower filesystem callback */ And if the user chooses option (2) we get: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() -> vfs_getxattr() -> xattr_getsecurity() -> security_inode_getsecurity() _______________________________ -> cap_inode_getsecurity() | | { V | 10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); | 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | /* Expected result is 0 and thus that we own the fscap. */ | 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | } | -> vfs_getxattr_alloc() | -> handler->get == ovl_other_xattr_get() | |-> vfs_getxattr() | -> xattr_getsecurity() | -> security_inode_getsecurity() | -> cap_inode_getsecurity() | { | 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); | |____________________________________________________________________| } -> vfs_getxattr_alloc() -> handler->get == /* lower filesystem callback */ We can see how the translation happens correctly in those cases as the conversion happens within the vfs_getxattr() helper. For POSIX ACLs we need to do something similar. However, in contrast to fscaps we cannot apply the fix directly to the kernel internal posix acl data structure as this would alter the cached values and would also require a rework of how we currently deal with POSIX ACLs in general which almost never take the filesystem idmapping into account (the noteable exception being FUSE but even there the implementation is special) and instead retrieve the raw values based on the initial idmapping. The correct values are then generated right before returning to userspace. The fix for this is to move taking the mount's idmapping into account directly in vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user(). To this end we simply move the idmapped mount translation into a separate step performed in vfs_{g,s}etxattr() instead of in posix_acl_fix_xattr_{from,to}_user(). To see how this fixes things let's go back to the original example. Assume the user chose option (1) and mounted overlayfs on top of idmapped mounts on the host: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | |> __vfs_getxattr() | | -> handler->get == ovl_posix_acl_xattr_get() | | -> ovl_xattr_get() | | -> vfs_getxattr() | | |> __vfs_getxattr() | | | -> handler->get() /* lower filesystem callback */ | | |> posix_acl_getxattr_idmapped_mnt() | | { | | 4 = make_kuid(&init_user_ns, 4); | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); | | 10000004 = from_kuid(&init_user_ns, 10000004); | | |_______________________ | | } | | | | | |> posix_acl_getxattr_idmapped_mnt() | | { | | V | 10000004 = make_kuid(&init_user_ns, 10000004); | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); | 10000004 = from_kuid(&init_user_ns, 10000004); | } |_________________________________________________ | | | | |> posix_acl_fix_xattr_to_user() | { V 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); /* SUCCESS */ 4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004); } And similarly if the user chooses option (1) and mounted overayfs on top of idmapped mounts inside the container: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | |> __vfs_getxattr() | | -> handler->get == ovl_posix_acl_xattr_get() | | -> ovl_xattr_get() | | -> vfs_getxattr() | | |> __vfs_getxattr() | | | -> handler->get() /* lower filesystem callback */ | | |> posix_acl_getxattr_idmapped_mnt() | | { | | 4 = make_kuid(&init_user_ns, 4); | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); | | 10000004 = from_kuid(&init_user_ns, 10000004); | | |_______________________ | | } | | | | | |> posix_acl_getxattr_idmapped_mnt() | | { V | 10000004 = make_kuid(&init_user_ns, 10000004); | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); | 10000004 = from_kuid(0(&init_user_ns, 10000004); | |_________________________________________________ | } | | | |> posix_acl_fix_xattr_to_user() | { V 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); /* SUCCESS */ 4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004); } The last remaining problem we need to fix here is ovl_get_acl(). During ovl_permission() overlayfs will call: ovl_permission() -> generic_permission() -> acl_permission_check() -> check_acl() -> get_acl() -> inode->i_op->get_acl() == ovl_get_acl() > get_acl() /* on the underlying filesystem) ->inode->i_op->get_acl() == /*lower filesystem callback */ -> posix_acl_permission() passing through the get_acl request to the underlying filesystem. This will retrieve the acls stored in the lower filesystem without taking the idmapping of the underlying mount into account as this would mean altering the cached values for the lower filesystem. The simple solution is to have ovl_get_acl() simply duplicate the ACLs, update the values according to the idmapped mount and return it to acl_permission_check() so it can be used in posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be released right after. Link: https://github.com/brauner/mount-idmapped/issues/9 Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: linux-unionfs@vger.kernel.org Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Fixes: bc70682a497c ("ovl: support idmapped layers") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-08Merge branch 'sysctl-data-races'David S. Miller
Kuniyuki Iwashima says: ==================== sysctl: Fix data-races around ipv4_table. A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. The first half of this series changes some proc handlers used in ipv4_table to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. Then, the second half adds READ_ONCE() to the other readers of ipv4_table. Changes: v2: * Drop some changes that makes backporting difficult * First cleanup patch * Lockless helpers and .proc_handler changes * Drop the tracing part for .sysctl_mem * Steve already posted a fix * Drop int-to-bool change for cipso * Should be posted to net-next later * Drop proc_dobool() change * Can be included in another series v1: https://lore.kernel.org/netdev/20220706052130.16368-1-kuniyu@amazon.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08ipv4: Fix a data-race around sysctl_fib_sync_mem.Kuniyuki Iwashima
While reading sysctl_fib_sync_mem, it can be changed concurrently. So, we need to add READ_ONCE() to avoid a data-race. Fixes: 9ab948a91b2c ("ipv4: Allow amount of dirty memory from fib resizing to be controllable") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08icmp: Fix data-races around sysctl.Kuniyuki Iwashima
While reading icmp sysctl variables, they can be changed concurrently. So, we need to add READ_ONCE() to avoid data-races. Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>