summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-02-20Merge tag 'fs.acl.v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs acl update from Christian Brauner: "This contains a single update to the internal get acl method and replaces an open-coded cmpxchg() comparison with with try_cmpxchg(). It's clearer and also beneficial on some architectures" * tag 'fs.acl.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: posix_acl: Use try_cmpxchg in get_acl
2023-02-20Merge tag 'fs.v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs hardening update from Christian Brauner: "Jan pointed out that during shutdown both filp_close() and super block destruction will use basic printk logging when bugs are detected. This causes issues in a few scenarios: - Tools like syzkaller cannot figure out that the logged message indicates a bug. - Users that explicitly opt in to have the kernel bug on data corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see the kernel crash when they did actually select that option. - When there are busy inodes after the superblock is shut down later access to such a busy inodes walks through freed memory. It would be better to cleanly crash instead. All of this can be addressed by using the already existing CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are detected. Its logging improvement is useful for all users. Otherwise this only has a meaningful behavioral effect when users do select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward compatible for regular users" * tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-20sched/topology: fix KASAN warning in hop_cmp()Yury Norov
Despite that prev_hop is used conditionally on cur_hop is not the first hop, it's initialized unconditionally. Because initialization implies dereferencing, it might happen that the code dereferences uninitialized memory, which has been spotted by KASAN. Fix it by reorganizing hop_cmp() logic. Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()") Signed-off-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/Y+7avK6V9SyAWsXi@yury-laptop/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20Merge tag 'iversion-v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull i_version updates from Jeff Layton: "This overhauls how we handle i_version queries from nfsd. Instead of having special routines and grabbing the i_version field directly out of the inode in some cases, we've moved most of the handling into the various filesystems' getattr operations. As a bonus, this makes ceph's change attribute usable by knfsd as well. This should pave the way for future work to make this value queryable by userland, and to make it more resilient against rolling back on a crash" * tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: nfsd: remove fetch_iversion export operation nfsd: use the getattr operation to fetch i_version nfsd: move nfsd4_change_attribute to nfsfh.c ceph: report the inode version in getattr if requested nfs: report the inode version in getattr if requested vfs: plumb i_version handling into struct kstat fs: clarify when the i_version counter must be updated fs: uninline inode_query_iversion
2023-02-20Merge tag 'locks-v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change here is that I've broken out most of the file locking definitions into a new header file. I also went ahead and completed the removal of locks_inode function" * tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove locks_inode filelock: move file locking definitions to separate header file
2023-02-20Merge tag 'tpm-v6.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm updates from Jarkko Sakkinen: "In additon to bug fixes, these are noteworthy changes: - In TPM I2C drivers, migrate from probe() to probe_new() (a new driver model in I2C). - TPM CRB: Pluton support - Add duplicate hash detection to the blacklist keyring in order to give more meaningful klog output than e.g. [1]" Link: https://askubuntu.com/questions/1436856/ubuntu-22-10-blacklist-problem-blacklisting-hash-13-message-on-boot [1] * tag 'tpm-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: add vendor flag to command code validation tpm: Add reserved memory event log tpm: Use managed allocation for bios event log tpm: tis_i2c: Convert to i2c's .probe_new() tpm: tpm_i2c_nuvoton: Convert to i2c's .probe_new() tpm: tpm_i2c_infineon: Convert to i2c's .probe_new() tpm: tpm_i2c_atmel: Convert to i2c's .probe_new() tpm: st33zp24: Convert to i2c's .probe_new() KEYS: asymmetric: Fix ECDSA use via keyctl uapi certs: don't try to update blacklist keys KEYS: Add new function key_create() certs: make blacklisted hash available in klog tpm_crb: Add support for CRB devices based on Pluton crypto: certs: fix FIPS selftest dependency
2023-02-20Merge tag 'rust-6.3' of https://github.com/Rust-for-Linux/linuxLinus Torvalds
Pull Rust updates from Miguel Ojeda: "More core additions, getting closer to a point where the first Rust modules can be upstreamed. The major ones being: - Sync: new types 'Arc', 'ArcBorrow' and 'UniqueArc'. - Types: new trait 'ForeignOwnable' and new type 'ScopeGuard'. There is also a substantial removal in terms of lines: - 'alloc' crate: remove the 'borrow' module (type 'Cow' and trait 'ToOwned')" * tag 'rust-6.3' of https://github.com/Rust-for-Linux/linux: rust: delete rust-project.json when running make clean rust: MAINTAINERS: Add the zulip link rust: types: implement `ForeignOwnable` for `Arc<T>` rust: types: implement `ForeignOwnable` for the unit type rust: types: implement `ForeignOwnable` for `Box<T>` rust: types: introduce `ForeignOwnable` rust: types: introduce `ScopeGuard` rust: prelude: prevent doc inline of external imports rust: sync: add support for dispatching on Arc and ArcBorrow. rust: sync: introduce `UniqueArc` rust: sync: allow type of `self` to be `ArcBorrow<T>` rust: sync: introduce `ArcBorrow` rust: sync: allow coercion from `Arc<T>` to `Arc<U>` rust: sync: allow type of `self` to be `Arc<T>` or variants rust: sync: add `Arc` for ref-counted allocations rust: compiler_builtins: make stubs non-global rust: alloc: remove the `borrow` module (`ToOwned`, `Cow`)
2023-02-20arm64: fix .idmap.text assertion for large kernelsMark Rutland
When building a kernel with many debug options enabled (which happens in test configurations use by myself and syzbot), the kernel can become large enough that portions of .text can be more than 128M away from .idmap.text (which is placed inside the .rodata section). Where idmap code branches into .text, the linker will place veneers in the .idmap.text section to make those branches possible. Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of padding when adding such veneers, e.g. | .idmap.text 0xffffffc01e48e5c0 0x32c arch/arm64/mm/proc.o | 0xffffffc01e48e5c0 idmap_cpu_replace_ttbr1 | 0xffffffc01e48e600 idmap_kpti_install_ng_mappings | 0xffffffc01e48e800 __cpu_setup | *fill* 0xffffffc01e48e8ec 0x4 | .idmap.text.stub | 0xffffffc01e48e8f0 0x18 linker stubs | 0xffffffc01e48f8f0 __idmap_text_end = . | 0xffffffc01e48f000 . = ALIGN (0x1000) | *fill* 0xffffffc01e48f8f0 0x710 | 0xffffffc01e490000 idmap_pg_dir = . This makes the __idmap_text_start .. __idmap_text_end region bigger than the 4K we require it to fit within, and triggers an assertion in arm64's vmlinux.lds.S, which breaks the build: | LD .tmp_vmlinux.kallsyms1 | aarch64-linux-gnu-ld: ID map text too big or misaligned | make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1 | make: *** [Makefile:1264: vmlinux] Error 2 Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of .idmap.text, which avoids the need for veneers. These branches are only executed once per boot, and only when the MMU is on, so there should be no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`. At the same time, remove the "x" and "w" attributes when placing code in .idmap.text, as these are not necessary, and this will prevent the linker from assuming that it is safe to place PLTs into .idmap.text, causing it to warn if and when there are out-of-range branches within .idmap.text, e.g. | LD .tmp_vmlinux.kallsyms1 | arch/arm64/kernel/head.o: in function `primary_entry': | (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o | arch/arm64/kernel/head.o: in function `init_el2': | (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o | make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1 | make: *** [Makefile:1252: vmlinux] Error 2 Thus, if future changes add out-of-range branches in .idmap.text, it should be easy enough to identify those from the resulting linker errors. Reported-by: syzbot+f8ac312e31226e23302b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-arm-kernel/00000000000028ea4105f4e2ef54@google.com/ Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Will Deacon <will@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20230220162317.1581208-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-20Merge tag 'remove-get_kernel_pages-for-6.3' of ↵Linus Torvalds
https://git.linaro.org/people/jens.wiklander/linux-tee Pull TEE update from Jens Wiklander: "Remove get_kernel_pages() Vmalloc page support is removed from shm_get_kernel_pages() and the get_kernel_pages() call is replaced by calls to get_page(). With no remaining callers of get_kernel_pages() the function is removed" [ This looks like it's just some random 'tee' cleanup, but the bigger picture impetus for this is really to to to remove historical confusion with mixed use of kernel virtual addresses and 'struct page' pointers. Kernel virtual pointers in the vmalloc space is then particularly confusing - both for looking up a page pointer (when trying to then unify a "virtual address or page" interface) and _particularly_ when mixed with HIGHMEM support and the kmap*() family of remapping. This is particularly true with HIGHMEM getting much less test coverage with 32-bit architectures being increasingly legacy targets. So we actively wanted to remove get_kernel_pages() to make sure nobody else used it too, and thus the 'tee' part is "finally remove last user". See also commit 6647e76ab623 ("v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() fails") for a totally different version of a conceptually similar "let's stop this confusion of different ways of referring to memory". - Linus ] * tag 'remove-get_kernel_pages-for-6.3' of https://git.linaro.org/people/jens.wiklander/linux-tee: mm: Remove get_kernel_pages() tee: Remove call to get_kernel_pages() tee: Remove vmalloc page support highmem: Enhance is_kmap_addr() to check kmap_local_page() mappings
2023-02-20net: bcmgenet: Support wake-up from s2idleFlorian Fainelli
When we suspend into s2idle we also need to enable the interrupt line that generates the MPD and HFB interrupts towards the host CPU interrupt controller (typically the ARM GIC or MIPS L1) to make it exit s2idle. When we suspend into other modes such as "standby" or "mem" we engage a power management state machine which will gate off the CPU L1 controller (priv->irq0) and ungate the side band wake-up interrupt (priv->wol_irq). It is safe to have both enabled as wake-up sources because they are mutually exclusive given any suspend mode. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20scm: add user copy checks to put_cmsg()Eric Dumazet
This is a followup of commit 2558b8039d05 ("net: use a bounce buffer for copying skb->mark") x86 and powerpc define user_access_begin, meaning that they are not able to perform user copy checks when using user_write_access_begin() / unsafe_copy_to_user() and friends [1] Instead of waiting bugs to trigger on other arches, add a check_object_size() in put_cmsg() to make sure that new code tested on x86 with CONFIG_HARDENED_USERCOPY=y will perform more security checks. [1] We can not generically call check_object_size() from unsafe_copy_to_user() because UACCESS is enabled at this point. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20devlink: drop leftover duplicate/unused codePaolo Abeni
The recent merge from net left-over some unused code in leftover.c - nomen omen. Just drop the unused bits. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge tag 'linux-can-next-for-6.3-20230217' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2023-02-17 - fixed this is a pull request of 4 patches for net-next/master. The first patch is by Yang Li and converts the ctucanfd driver to devm_platform_ioremap_resource(). The last 3 patches are by Frank Jungclaus, target the esd_usb driver and contains preparations for the upcoming support of the esd CAN-USB/3 hardware. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: lan966x: Use automatic selection of VCAP rule actionsetHoratiu Vultur
Since commit 81e164c4aec5 ("net: microchip: sparx5: Add automatic selection of VCAP rule actionset") the VCAP API has the capability to select automatically the actionset based on the actions that are attached to the rule. So it is not needed anymore to hardcode the actionset in the driver, therefore it is OK to remove this. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge branch 'default_rps_mask-follow-up'David S. Miller
Paolo Abeni says: ==================== net: default_rps_mask follow-up The first patch namespacify the setting. In the common case, once proper isolation is in place in the main namespace, forwarding to/from each child netns will allways happen on the desidered CPUs. Any additional RPS stage inside the child namespace will not provide additional isolation and could hurt performance badly if picking a CPU on a remote node. The 2nd patch adds more self-tests coverage. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20self-tests: more rps self testsPaolo Abeni
Explicitly check for child netns and main ns independency Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: make default_rps_mask a per netns attributePaolo Abeni
That really was meant to be a per netns attribute from the beginning. The idea is that once proper isolation is in place in the main namespace, additional demux in the child namespaces will be redundant. Let's make child netns default rps mask empty by default. To avoid bloating the netns with a possibly large cpumask, allocate it on-demand during the first write operation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge tag 'wireless-next-2023-02-17' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.3 Third set of patches for v6.3. This time only a set of small fixes submitted during the last day or two. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add safeguard to check for NULL tupe in objects updates via NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari. 2) Incorrect pointer check in the new destroy rule command, from Yang Yingliang. 3) Incorrect status bitcheck in nf_conntrack_udp_packet(), from Florian Westphal. 4) Simplify seq_print_acct(), from Ilia Gavrilov. 5) Use 2-arg optimal variant of kfree_rcu() in IPVS, from Julian Anastasov. 6) TCP connection enters CLOSE state in conntrack for locally originated TCP reset packet from the reject target, from Florian Westphal. The fixes #2 and #3 in this series address issues from the previous pull nf-next request in this net-next cycle. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: microchip: sparx5: reduce stack usageArnd Bergmann
The vcap_admin structures in vcap_api_next_lookup_advanced_test() take several hundred bytes of stack frame, but when CONFIG_KASAN_STACK is enabled, each one of them also has extra padding before and after it, which ends up blowing the warning limit: In file included from drivers/net/ethernet/microchip/vcap/vcap_api.c:3521: drivers/net/ethernet/microchip/vcap/vcap_api_kunit.c: In function 'vcap_api_next_lookup_advanced_test': drivers/net/ethernet/microchip/vcap/vcap_api_kunit.c:1954:1: error: the frame size of 1448 bytes is larger than 1400 bytes [-Werror=frame-larger-than=] 1954 | } Reduce the total stack usage by replacing the five structures with an array that only needs one pair of padding areas. Fixes: 1f741f001160 ("net: microchip: sparx5: Add KUNIT tests for enabling/disabling chains") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20sfc: use IS_ENABLED() checks for CONFIG_SFC_SRIOVArnd Bergmann
One local variable has become unused after a recent change: drivers/net/ethernet/sfc/ef100_nic.c: In function 'ef100_probe_netdev_pf': drivers/net/ethernet/sfc/ef100_nic.c:1155:21: error: unused variable 'net_dev' [-Werror=unused-variable] struct net_device *net_dev = efx->net_dev; ^~~~~~~ The variable is still used in an #ifdef. Replace the #ifdef with an if(IS_ENABLED()) check that lets the compiler see where it is used, rather than adding another #ifdef. This also fixes an uninitialized return value in ef100_probe_netdev_pf() that gcc did not spot. Fixes: 7e056e2360d9 ("sfc: obtain device mac address based on firmware handle for ef100") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ice: properly alloc ICE_VSI_LBMichal Swiatkowski
Devlink reload patchset introduced regression. ICE_VSI_LB wasn't taken into account when doing default allocation. Fix it by adding a case for ICE_VSI_LB in ice_vsi_alloc_def(). Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions") Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20sfc: Fix spelling mistake "creationg" -> "creating"Colin Ian King
There is a spelling mistake in a pci_warn message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by:  Alejandro Lucero <alejandro.lucero-palau@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20octeontx2-af: Add NIX Errata workaround on CN10K siliconGeetha sowjanya
This patch adds workaround for below 2 HW erratas 1. Due to improper clock gating, NIXRX may free the same NPA buffer multiple times.. to avoid this, always enable NIX RX conditional clock. 2. NIX FIFO does not get initialized on reset, if the SMQ flush is triggered before the first packet is processed, it will lead to undefined state. The workaround to perform SMQ flush only if packet count is non-zero in MDQ. Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: Sai Krishna <saikrishnag@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: phy: Read EEE abilities when using .featuresAndrew Lunn
A PHY driver can use a static integer value to indicate what link mode features it supports, i.e, its abilities.. This is the old way, but useful when dynamically determining the devices features does not work, e.g. support of fibre. EEE support has been moved into phydev->supported_eee. This needs to be set otherwise the code assumes EEE is not supported. It is normally set as part of reading the devices abilities. However if a static integer value was used, the dynamic reading of the abilities is not performed. Add a call to genphy_c45_read_eee_abilities() to read the EEE abilities. Fixes: 8b68710a3121 ("net: phy: start using genphy_c45_ethtool_get/set_eee()") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge branch 'phydev-locks'David S. Miller
Andrew Lunn says: ==================== Add additional phydev locks The phydev lock should be held when accessing members of phydev, or calling into the driver. Some of the phy_ethtool_ functions are missing locks. Add them. To avoid deadlock the marvell driver is modified since it calls one of the functions which gain locks, which would result in a deadlock. The missing locks have not caused noticeable issues, so these patches are for net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: phy: Add locks to ethtool functionsAndrew Lunn
The phydev lock should be held while accessing members of phydev, or calling into the driver. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: phy: marvell: Use the unlocked genphy_c45_ethtool_get_eee()Andrew Lunn
phy_ethtool_get_eee() is about to gain locking of the phydev lock. This means it cannot be used within a PHY driver without causing a deadlock. Swap to using genphy_c45_ethtool_get_eee() which assumes the lock has already been taken. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: bcmgenet: fix MoCA LED controlDoug Berger
When the bcmgenet_mii_config() code was refactored it was missed that the LED control for the MoCA interface got overwritten by the port_ctrl value. Its previous programming is restored here. Fixes: 4f8d81b77e66 ("net: bcmgenet: Refactor register access in bcmgenet_mii_config") Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20l2tp: Avoid possible recursive deadlock in l2tp_tunnel_register()Shigeru Yoshida
When a file descriptor of pppol2tp socket is passed as file descriptor of UDP socket, a recursive deadlock occurs in l2tp_tunnel_register(). This situation is reproduced by the following program: int main(void) { int sock; struct sockaddr_pppol2tp addr; sock = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); if (sock < 0) { perror("socket"); return 1; } addr.sa_family = AF_PPPOX; addr.sa_protocol = PX_PROTO_OL2TP; addr.pppol2tp.pid = 0; addr.pppol2tp.fd = sock; addr.pppol2tp.addr.sin_family = PF_INET; addr.pppol2tp.addr.sin_port = htons(0); addr.pppol2tp.addr.sin_addr.s_addr = inet_addr("192.168.0.1"); addr.pppol2tp.s_tunnel = 1; addr.pppol2tp.s_session = 0; addr.pppol2tp.d_tunnel = 0; addr.pppol2tp.d_session = 0; if (connect(sock, (const struct sockaddr *)&addr, sizeof(addr)) < 0) { perror("connect"); return 1; } return 0; } This program causes the following lockdep warning: ============================================ WARNING: possible recursive locking detected 6.2.0-rc5-00205-gc96618275234 #56 Not tainted -------------------------------------------- repro/8607 is trying to acquire lock: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: l2tp_tunnel_register+0x2b7/0x11c0 but task is already holding lock: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sk_lock-AF_PPPOX); lock(sk_lock-AF_PPPOX); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by repro/8607: #0: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30 stack backtrace: CPU: 0 PID: 8607 Comm: repro Not tainted 6.2.0-rc5-00205-gc96618275234 #56 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x100/0x178 __lock_acquire.cold+0x119/0x3b9 ? lockdep_hardirqs_on_prepare+0x410/0x410 lock_acquire+0x1e0/0x610 ? l2tp_tunnel_register+0x2b7/0x11c0 ? lock_downgrade+0x710/0x710 ? __fget_files+0x283/0x3e0 lock_sock_nested+0x3a/0xf0 ? l2tp_tunnel_register+0x2b7/0x11c0 l2tp_tunnel_register+0x2b7/0x11c0 ? sprintf+0xc4/0x100 ? l2tp_tunnel_del_work+0x6b0/0x6b0 ? debug_object_deactivate+0x320/0x320 ? lockdep_init_map_type+0x16d/0x7a0 ? lockdep_init_map_type+0x16d/0x7a0 ? l2tp_tunnel_create+0x2bf/0x4b0 ? l2tp_tunnel_create+0x3c6/0x4b0 pppol2tp_connect+0x14e1/0x1a30 ? pppol2tp_put_sk+0xd0/0xd0 ? aa_sk_perm+0x2b7/0xa80 ? aa_af_perm+0x260/0x260 ? bpf_lsm_socket_connect+0x9/0x10 ? pppol2tp_put_sk+0xd0/0xd0 __sys_connect_file+0x14f/0x190 __sys_connect+0x133/0x160 ? __sys_connect_file+0x190/0x190 ? lockdep_hardirqs_on+0x7d/0x100 ? ktime_get_coarse_real_ts64+0x1b7/0x200 ? ktime_get_coarse_real_ts64+0x147/0x200 ? __audit_syscall_entry+0x396/0x500 __x64_sys_connect+0x72/0xb0 do_syscall_64+0x38/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd This patch fixes the issue by getting/creating the tunnel before locking the pppol2tp socket. Fixes: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()") Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge branch 'icmp6-drop-reason'David S. Miller
Eric Dumazet says: ==================== ipv6: icmp6: better drop reason support This series aims to have more precise drop reason reports for icmp6. This should reduce false positives on most usual cases. This can be extended as needed later. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add drop reason support to icmpv6_echo_reply()Eric Dumazet
Change icmpv6_echo_reply() to return a drop reason. For the moment, return NOT_SPECIFIED or SKB_CONSUMED. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_NS_OTHERHOSTEric Dumazet
Hosts can often receive neighbour discovery messages that are not for them. Use a dedicated drop reason to make clear the packet is dropped for this normal case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONSEric Dumazet
This is a generic drop reason for any error detected in ndisc_parse_options(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_redirect_rcv()Eric Dumazet
Change ndisc_redirect_rcv() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED and values from icmpv6_notify(). More reasons are added later. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_router_discovery()Eric Dumazet
Change ndisc_router_discovery() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED and SKB_CONSUMED. More reasons are added later. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_recv_rs()Eric Dumazet
Change ndisc_recv_rs() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED or SKB_CONSUMED. More reasons are added later. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_recv_na()Eric Dumazet
Change ndisc_recv_na() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED or SKB_CONSUMED. More reasons are added later. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_recv_ns()Eric Dumazet
Change ndisc_recv_ns() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED or SKB_CONSUMED. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net: add location to trace_consume_skb()Eric Dumazet
kfree_skb() includes the location, it makes sense to add it to consume_skb() as well. After patch: taskd_EventMana 8602 [004] 420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic swapper 0 [011] 422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc discipline 9141 [043] 423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp swapper 0 [010] 423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv borglet 8672 [014] 425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump swapper 0 [028] 426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action wget 14339 [009] 426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20selftests/net: Interpret UDP_GRO cmsg data as an int valueJakub Sitnicki
Data passed to user-space with a (SOL_UDP, UDP_GRO) cmsg carries an int (see udp_cmsg_recv), not a u16 value, as strace confirms: recvmsg(8, {msg_name=..., msg_iov=[{iov_base="\0\0..."..., iov_len=96000}], msg_iovlen=1, msg_control=[{cmsg_len=20, <-- sizeof(cmsghdr) + 4 cmsg_level=SOL_UDP, cmsg_type=0x68}], <-- UDP_GRO msg_controllen=24, msg_flags=0}, 0) = 11200 Interpreting the data as an u16 value won't work on big-endian platforms. Since it is too late to back out of this API decision [1], fix the test. [1]: https://lore.kernel.org/netdev/20230131174601.203127-1-jakub@cloudflare.com/ Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20qede: fix interrupt coalescing configurationManish Chopra
On default driver load device gets configured with unexpected higher interrupt coalescing values instead of default expected values as memory allocated from krealloc() is not supposed to be zeroed out and may contain garbage values. Fix this by allocating the memory of required size first with kcalloc() and then use krealloc() to resize and preserve the contents across down/up of the interface. Signed-off-by: Manish Chopra <manishc@marvell.com> Fixes: b0ec5489c480 ("qede: preserve per queue stats across up/down of interface") Cc: stable@vger.kernel.org Cc: Bhaskar Upadhaya <bupadhaya@marvell.com> Cc: David S. Miller <davem@davemloft.net> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2160054 Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20xsk: support use vaddr as ringXuan Zhuo
When we try to start AF_XDP on some machines with long running time, due to the machine's memory fragmentation problem, there is no sufficient contiguous physical memory that will cause the start failure. If the size of the queue is 8 * 1024, then the size of the desc[] is 8 * 1024 * 8 = 16 * PAGE, but we also add struct xdp_ring size, so it is 16page+. This is necessary to apply for a 4-order memory. If there are a lot of queues, it is difficult to these machine with long running time. Here, that we actually waste 15 pages. 4-Order memory is 32 pages, but we only use 17 pages. This patch replaces __get_free_pages() by vmalloc() to allocate memory to solve these problems. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net/smc: fix application data exceptionD. Wythe
There is a certain probability that following exceptions will occur in the wrk benchmark test: Running 10s test @ http://11.213.45.6:80 8 threads and 64 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.72ms 13.94ms 245.33ms 94.17% Req/Sec 1.96k 713.67 5.41k 75.16% 155262 requests in 10.10s, 23.10MB read Non-2xx or 3xx responses: 3 We will find that the error is HTTP 400 error, which is a serious exception in our test, which means the application data was corrupted. Consider the following scenarios: CPU0 CPU1 buf_desc->used = 0; cmpxchg(buf_desc->used, 0, 1) deal_with(buf_desc) memset(buf_desc->cpu_addr,0); This will cause the data received by a victim connection to be cleared, thus triggering an HTTP 400 error in the server. This patch exchange the order between clear used and memset, add barrier to ensure memory consistency. Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()D. Wythe
There is a certain chance to trigger the following panic: PID: 5900 TASK: ffff88c1c8af4100 CPU: 1 COMMAND: "kworker/1:48" #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7 #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60 #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7 #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715 #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654 #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62 [exception RIP: ib_alloc_mr+19] RIP: ffffffffc0c9cce3 RSP: ffff9456c1cc7c38 RFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000004 RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88c1ea281d00 R8: 000000020a34ffff R9: ffff88c1350bbb20 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000010 R14: ffff88c1ab040a50 R15: ffff88c1ea281d00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc] #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc] #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc] The reason here is that when the server tries to create a second link, smc_llc_srv_add_link() has no protection and may add a new link to link group. This breaks the security environment protected by llc_conf_mutex. Fixes: 2d2209f20189 ("net/smc: first part of add link processing as SMC server") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20Merge branch 'taprio-queuemaxsdu-fixes'Paolo Abeni
Vladimir Oltean says: ==================== taprio queueMaxSDU fixes This fixes 3 issues noticed while attempting to reoffload the dynamically calculated queueMaxSDU values. These are: - Dynamic queueMaxSDU is not calculated correctly due to a lost patch - Dynamically calculated queueMaxSDU needs to be clamped on the low end - Dynamically calculated queueMaxSDU needs to be clamped on the high end ==================== Link: https://lore.kernel.org/r/20230215224632.2532685-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-20net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimitedVladimir Oltean
It makes no sense to keep randomly large max_sdu values, especially if larger than the device's max_mtu. These are visible in "tc qdisc show". Such a max_sdu is practically unlimited and will cause no packets for that traffic class to be dropped on enqueue. Just set max_sdu_dynamic to U32_MAX, which in the logic below causes taprio to save a max_frm_len of U32_MAX and a max_sdu presented to user space of 0 (unlimited). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-20net/sched: taprio: don't allow dynamic max_sdu to go negative after stab ↵Vladimir Oltean
adjustment The overhead specified in the size table comes from the user. With small time intervals (or gates always closed), the overhead can be larger than the max interval for that traffic class, and their difference is negative. What we want to happen is for max_sdu_dynamic to have the smallest non-zero value possible (1) which means that all packets on that traffic class are dropped on enqueue. However, since max_sdu_dynamic is u32, a negative is represented as a large value and oversized dropping never happens. Use max_t with int to force a truncation of max_frm_len to no smaller than dev->hard_header_len + 1, which in turn makes max_sdu_dynamic no smaller than 1. Fixes: fed87cc6718a ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-20net/sched: taprio: fix calculation of maximum gate durationsVladimir Oltean
taprio_calculate_gate_durations() depends on netdev_get_num_tc() and this returns 0. So it calculates the maximum gate durations for no traffic class. I had tested the blamed commit only with another patch in my tree, one which in the end I decided isn't valuable enough to submit ("net/sched: taprio: mask off bits in gate mask that exceed number of TCs"). The problem is that having this patch threw off my testing. By moving the netdev_set_num_tc() call earlier, we implicitly gave to taprio_calculate_gate_durations() the information it needed. Extract only the portion from the unsubmitted change which applies the mqprio configuration to the netdev earlier. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-15-vladimir.oltean@nxp.com/ Fixes: a306a90c8ffe ("net/sched: taprio: calculate tc gate durations") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>