summaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)Author
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-19Merge branch 'pci/npem'Bjorn Helgaas
- Initialize leds class earlier (with an unfortunate Makefile ordering change) so the PCI NPEM driver can use it (Mariusz Tkaczyk) - Add Native PCIe Enclosure Management (NPEM) support for sysfs control of NVMe RAID storage indicators (ok/fail/locate/rebuild/etc) (Mariusz Tkaczyk) - Add support for the ACPI _DSM PCIe SSD status LED management, which is functionally similar to NPEM but mediated by platform firmware (Mariusz Tkaczyk) * pci/npem: PCI/NPEM: Add _DSM PCIe SSD status LED management PCI/NPEM: Add Native PCIe Enclosure Management support leds: Init leds class earlier
2024-09-19Merge tag 'sched-core-2024-09-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) * tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/cpufreq: Use NSEC_PER_MSEC for deadline task cpufreq/cppc: Use NSEC_PER_MSEC for deadline task sched/deadline: Clarify nanoseconds in uapi sched/deadline: Convert schedtool example to chrt sched/debug: Fix the runnable tasks output sched: Fix sched_delayed vs sched_core kernel/sched: Fix util_est accounting for DELAY_DEQUEUE kthread: Fix task state in kthread worker if being frozen sched/pelt: Use rq_clock_task() for hw_pressure sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() sched: Add put_prev_task(.next) sched: Rework dl_server sched: Combine the last put_prev_task() and the first set_next_task() sched: Rework pick_next_task() sched: Split up put_prev_task_balance() sched: Clean up DL server vs core sched sched: Fixup set_next_task() implementations sched: Use set_next_task(.first) where required sched/fair: Properly deactivate sched_delayed task upon class change ...
2024-09-19Merge tag 'drm-next-2024-09-19' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm updates from Dave Airlie: "This adds a couple of patches outside the drm core, all should be acked appropriately, the string and pstore ones are the main ones that come to mind. Otherwise it's the usual drivers, xe is getting enabled by default on some new hardware, we've changed the device number handling to allow more devices, and we added some optional rust code to create QR codes in the panic handler, an idea first suggested I think 10 years ago :-) string: - add mem_is_zero() core: - support more device numbers - use XArray for minor ids - add backlight constants - Split dma fence array creation into alloc and arm fbdev: - remove usage of old fbdev hooks kms: - Add might_fault() to drm_modeset_lock priming - Add dynamic per-crtc vblank configuration support dma-buf: - docs cleanup buddy: - Add start address support for trim function printk: - pass description to kmsg_dump scheduler: - Remove full_recover from drm_sched_start ttm: - Make LRU walk restartable after dropping locks - Allow direct reclaim to allocate local memory panic: - add display QR code (in rust) displayport: - mst: GUID improvements bridge: - Silence error message on -EPROBE_DEFER - analogix: Clean aup - bridge-connector: Fix double free - lt6505: Disable interrupt when powered off - tc358767: Make default DP port preemphasis configurable - lt9611uxc: require DRM_BRIDGE_ATTACH_NO_CONNECTOR - anx7625: simplify OF array handling - dw-hdmi: simplify clock handling - lontium-lt8912b: fix mode validation - nwl-dsi: fix mode vsync/hsync polarity xe: - Enable LunarLake and Battlemage support - Introducing Xe2 ccs modifiers for integrated and discrete graphics - rename xe perf to xe observation - use wb caching on DGFX for system memory - add fence timeouts - Lunar Lake graphics/media/display workarounds - Battlemage workarounds - Battlemage GSC support - GSC and HuC fw updates for LL/BM - use dma_fence_chain_free - refactor hw engine lookup and mmio access - enable priority mem read for Xe2 - Add first GuC BMG fw - fix dma-resv lock - Fix DGFX display suspend/resume - Use xe_managed for kernel BOs - Use reserved copy engine for user binds on faulting devices - Allow mixing dma-fence jobs and long-running faulting jobs - fix media TLB invalidation - fix rpm in TTM swapout path - track resources and VF state by PF i915: - Type-C programming fix for MTL+ - FBC cleanup - Calc vblank delay more accurately - On DP MST, Enable LT fallback for UHBR<->non-UHBR rates - Fix DP LTTPR detection - limit relocations to INT_MAX - fix long hangs in buddy allocator on DG2/A380 amdgpu: - Per-queue reset support - SDMA devcoredump support - DCN 4.0.1 updates - GFX12/VCN4/JPEG4 updates - Convert vbios embedded EDID to drm_edid - GFX9.3/9.4 devcoredump support - process isolation framework for GFX 9.4.3/4 - take IOMMU mappings into account for P2P DMA amdkfd: - CRIU fixes - HMM fix - Enable process isolation support for GFX 9.4.3/4 - Allow users to target recommended SDMA engines - KFD support for targetting queues on recommended SDMA engines radeon: - remove .load and drm_dev_alloc - Fix vbios embedded EDID size handling - Convert vbios embedded EDID to drm_edid - Use GEM references instead of TTM - r100 cp init cleanup - Fix potential overflows in evergreen CS offset tracking msm: - DPU: - implement DP/PHY mapping on SC8180X - Enable writeback on SM8150, SC8180X, SM6125, SM6350 - DP: - Enable widebus on all relevant chipsets - MSM8998 HDMI support - GPU: - A642L speedbin support - A615/A306/A621 support - A7xx devcoredump support ast: - astdp: Support AST2600 with VGA - Clean up HPD - Fix timeout loop for DP link training - reorganize output code by type (VGA, DP, etc) - convert to struct drm_edid - fix BMC handling for all outputs exynos: - drop stale MAINTAINERS pattern - constify struct loongson: - use GEM refcount over TTM mgag200: - Improve BMC handling - Support VBLANK intterupts - transparently support BMC outputs nouveau: - Refactor and clean up internals - Use GEM refcount over TTM's gm12u320: - convert to struct drm_edid gma500: - update i2c terms lcdif: - pixel clock fix host1x: - fix syncpoint IRQ during resume - use iommu_paging_domain_alloc() imx: - ipuv3: convert to struct drm_edid omapdrm: - improve error handling - use common helper for_each_endpoint_of_node() panel: - add support for BOE TV101WUM-LL2 plus DT bindings - novatek-nt35950: improve error handling - nv3051d: improve error handling - panel-edp: - add support for BOE NE140WUM-N6G - revert support for SDC ATNA45AF01 - visionox-vtdr6130: - improve error handling - use devm_regulator_bulk_get_const() - boe-th101mb31ig002: - Support for starry-er88577 MIPI-DSI panel plus DT - Fix porch parameter - edp: Support AOU B116XTN02.3, AUO B116XAN06.1, AOU B116XAT04.1, BOE NV140WUM-N41, BOE NV133WUM-N63, BOE NV116WHM-A4D, CMN N116BCA-EA2, CMN N116BCP-EA2, CSW MNB601LS1-4 - himax-hx8394: Support Microchip AC40T08A MIPI Display panel plus DT - ilitek-ili9806e: Support Densitron DMT028VGHMCMI-1D TFT plus DT - jd9365da: - Support Melfas lmfbx101117480 MIPI-DSI panel plus DT - Refactor for code sharing - panel-edp: fix name for HKC MB116AN01 - jd9365da: fix "exit sleep" commands - jdi-fhd-r63452: simplify error handling with DSI multi-style helpers - mantix-mlaf057we51: simplify error handling with DSI multi-style helpers - simple: - support Innolux G070ACE-LH3 plus DT bindings - support On Tat Industrial Company KD50G21-40NT-A1 plus DT bindings - st7701: - decouple DSI and DRM code - add SPI support - support Anbernic RG28XX plus DT bindings mediatek: - support alpha blending - remove cl in struct cmdq_pkt - ovl adaptor fix - add power domain binding for mediatek DPI controller renesas: - rz-du: add support for RZ/G2UL plus DT bindings rockchip: - Improve DP sink-capability reporting - dw_hdmi: Support 4k@60Hz - vop: - Support RGB display on Rockchip RK3066 - Support 4096px width sti: - convert to struct drm_edid stm: - Avoid UAF wih managed plane and CRTC helpers - Fix module owner - Fix error handling in probe - Depend on COMMON_CLK - ltdc: - Fix transparency after disabling plane - Remove unused interrupt tegra: - gr3d: improve PM domain handling - convert to struct drm_edid - Call drm_atomic_helper_shutdown() vc4: - fix PM during detect - replace DRM_ERROR() with drm_error() - v3d: simplify clock retrieval v3d: - Clean up perfmon virtio: - add DRM capset" * tag 'drm-next-2024-09-19' of https://gitlab.freedesktop.org/drm/kernel: (1326 commits) drm/xe: Fix missing conversion to xe_display_pm_runtime_resume drm/xe/xe2hpg: Add Wa_15016589081 drm/xe: Don't keep stale pointer to bo->ggtt_node drm/xe: fix missing 'xe_vm_put' drm/xe: fix build warning with CONFIG_PM=n drm/xe: Suppress missing outer rpm protection warning drm/xe: prevent potential UAF in pf_provision_vf_ggtt() drm/amd/display: Add all planes on CRTC to state for overlay cursor drm/i915/bios: fix printk format width drm/i915/display: Fix BMG CCS modifiers drm/amdgpu: get rid of bogus includes of fdtable.h drm/amdkfd: CRIU fixes drm/amdgpu: fix a race in kfd_mem_export_dmabuf() drm: new helper: drm_gem_prime_handle_to_dmabuf() drm/amdgpu/atomfirmware: Silence UBSAN warning drm/amdgpu: Fix kdoc entry in 'amdgpu_vm_cpu_prepare' drm/amd/amdgpu: apply command submission parser for JPEG v1 drm/amd/amdgpu: apply command submission parser for JPEG v2+ drm/amd/pm: fix the pp_dpm_pcie issue on smu v14.0.2/3 drm/amd/pm: update the features set on smu v14.0.2/3 ...
2024-09-19Merge tag 'hid-for-linus-2024091602' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID updates from Jiri Kosina: - New HID over SPI driver for Goodix devices that don't follow Microsoft's HID-over-SPI specification, so a separate driver is needed. Currently supported device is GT7986U touchscreen (Charles Wang) - support for new hardware features in Wacom driver (high-res wheel scrolling, touchstrings with relative motions, support for two touchrings) (Jason Gerecke) - support for customized vendor firmware loading in intel-ish driver (Zhang Lixu) - fix for theoretical race condition in i2c-hid (Dmitry Torokhov) - support for HIDIOCREVOKE -- evdev's EVIOCREVOKE equivalent in hidraw (Peter Hutterer) - initial hidraw selftest implementation (Benjamin Tissoires) - constification of device-specific report descriptors (Thomas Weißschuh) - other small assorted fixes and device ID / quirk additions * tag 'hid-for-linus-2024091602' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (54 commits) hid: cp2112: Use irq_get_trigger_type() helper HID: i2c-hid: ensure various commands do not interfere with each other HID: multitouch: Add support for Thinkpad X12 Gen 2 Kbd Portfolio HID: wacom: Do not warn about dropped packets for first packet HID: wacom: Support sequence numbers smaller than 16-bit HID: lg: constify fixed up report descriptor HID: uclogic: constify fixed up report descriptor HID: waltop: constify fixed up report descriptor HID: sony: constify fixed up report descriptor HID: pxrc: constify fixed up report descriptor HID: steelseries: constify fixed up report descriptor HID: viewsonic: constify fixed up report descriptor HID: vrc2: constify fixed up report descriptor HID: xiaomi: constify fixed up report descriptor HID: maltron: constify fixed up report descriptor HID: keytouch: constify fixed up report descriptor HID: holtek-kbd: constify fixed up report descriptor HID: dr: constify fixed up report descriptor HID: bigbenff: constify fixed up report descriptor HID: picoLCD: Use backlight power constants ...
2024-09-18Merge tag 'hardening-v6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - lib/string_choices: - Add str_up_down() helper (Michal Wajdeczko) - Add str_true_false()/str_false_true() helper (Hongbo Li) - Introduce several opposite string choice helpers (Hongbo Li) - lib/string_helpers: - rework overflow-dependent code (Justin Stitt) - fortify: refactor test_fortify Makefile to fix some build problems (Masahiro Yamada) - string: Check for "nonstring" attribute on strscpy() arguments - virt: vbox: Replace 1-element arrays with flexible arrays - media: venus: hfi_cmds: Replace 1-element arrays with flexible arrays * tag 'hardening-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: lib/string_choices: Add some comments to make more clear for string choices helpers. lib/string_choices: Introduce several opposite string choice helpers lib/string_choices: Add str_true_false()/str_false_true() helper string: Check for "nonstring" attribute on strscpy() arguments media: venus: hfi_cmds: struct hfi_session_release_buffer_pkt: Add __counted_by annotation media: venus: hfi_cmds: struct hfi_session_release_buffer_pkt: Replace 1-element array with flexible array virt: vbox: struct vmmdev_hgcm_pagelist: Replace 1-element array with flexible array lib/string_helpers: rework overflow-dependent code coccinelle: Add rules to find str_down_up() replacements string_choices: Add wrapper for str_down_up() coccinelle: Add rules to find str_up_down() replacements lib/string_choices: Add str_up_down() helper fortify: use if_changed_dep to record header dependency in *.cmd files fortify: move test_fortify.sh to lib/test_fortify/ fortify: refactor test_fortify Makefile to fix some build problems
2024-09-18exfat: Implement sops->shutdown and ioctlDongliang Cui
We found that when writing a large file through buffer write, if the disk is inaccessible, exFAT does not return an error normally, which leads to the writing process not stopping properly. To easily reproduce this issue, you can follow the steps below: 1. format a device to exFAT and then mount (with a full disk erase) 2. dd if=/dev/zero of=/exfat_mount/test.img bs=1M count=8192 3. eject the device You may find that the dd process does not stop immediately and may continue for a long time. The root cause of this issue is that during buffer write process, exFAT does not need to access the disk to look up directory entries or the FAT table (whereas FAT would do) every time data is written. Instead, exFAT simply marks the buffer as dirty and returns, delegating the writeback operation to the writeback process. If the disk cannot be accessed at this time, the error will only be returned to the writeback process, and the original process will not receive the error, so it cannot be returned to the user side. When the disk cannot be accessed normally, an error should be returned to stop the writing process. Implement sops->shutdown and ioctl to shut down the file system when underlying block device is marked dead. Signed-off-by: Dongliang Cui <dongliang.cui@unisoc.com> Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-09-17Merge tag 'sound-6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "A fairly big update at this time, both in core and driver sides. The core received rewrites in PCM buffer allocation handling and locking optimizations, PCM rate updates followed by lots of cleanups. In ASoC side, the legacy Intel drivers have been deprecated by AVS drivers which leaded to the significant amount of code reduction. SoundWire driver updates and other cleanups contributed more code reduction, too. USB-audio driver received a large cleanup of its big quirk table, and the old snd_print*() API usages in many legacy drivers are replaced with the standard print API. Here are some highlights: Core: - More optimized locking in ALSA control code - Rewrites of memalloc helpers for better DMA API usage - Drop of obsoleted vmalloc PCM buffer helper API - Continued MIDI2 UMP updates - Support of a new user-space driven timer instance - Update for more PCM support rates and cleanups - Xrun counter report in the proc files ASoC: - Continued simplification and cleanup works for ASoC - Extensive cleanups and refactoring of the Soundwire drivers - Removal of Intel machine support obsoleted by the AVS driver - Lots of DT schema conversions - Machine support for many AMD and Intel x86 platforms - Support for AMD ACP 7.1, Mediatek MT6367 and MT8365, Realtek RTL1320 SoundWire and rev C, and Texas Instruments TAS2563 USB-audio: - Add support of multiple control interfaces - A large rewrite of quirk table with macros - Support for RME Digiface USB HD-audio: - Cleanup of quirk code for Samsung Galaxy laptops - Clean up of detection of Cirrus codecs - C-Media CM9825 HD-audio codec support Others: - Rewrites to standard print API in a lot of legacy drivers" * tag 'sound-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (410 commits) ASoC: topology: Fix redundant logical jump ASoC: tas2781: Add Calibration Kcontrols for Chromebook ASoC: amd: acp: refactor SoundWire machine driver code ASoC: sdw_utils/intel: move soundwire endpoint parsing helper functions ASoC: sdw_util/intel: move soundwire endpoint and dai link structures ASoC: intel: sof_sdw: rename soundwire parsing helper functions ASoC: intel: sof_sdw: rename soundwire endpoint and dailink structures ASoC: atmel: mchp-pdmc: Retain Non-Runtime Controls ALSA: hda/realtek: Add support for Galaxy Book2 Pro (NP950XEE) ASoC: mediatek: mt7986-afe-pcm: Remove redundant error message ALSA: memalloc: Use proper DMA mapping API for x86 S/G buffer allocations ALSA: memalloc: Use proper DMA mapping API for x86 WC buffer allocations ALSA: usb-audio: Add logitech Audio profile quirk ASoc: mediatek: mt8365: Remove unneeded assignment ASoC: Intel: ARL: Add entry for HDMI-In capture support to non-I2S codec boards. ASoC: Intel: sof_rt5682: Add HDMI-In capture with rt5682 support for ARL. ASoC: SOF: Intel: hda: remove common_hdmi_codec_drv ASoC: Intel: sof_pcm512x: do not check common_hdmi_codec_drv ASoC: Intel: ehl_rt5660: do not check common_hdmi_codec_drv ASoC: Intel: skl_hda_dsp_generic: use common module for DAI links ...
2024-09-17Merge tag 'x86-fpu-2024-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Provide FPU buffer layout in core dumps: Debuggers have guess the FPU buffer layout in core dumps, which is error prone. This is because AMD and Intel layouts differ. To avoid buggy heuristics add a ELF section which describes the buffer layout which can be retrieved by tools" * tag 'x86-fpu-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/elf: Add a new FPU buffer layout info to x86 core files
2024-09-17Merge tag 'spi-v6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi updates from Mark Brown: "This is quite a quiet release for SPI. The one new core feature here is support for configuring the state of the MOSI pin when the bus is idle, there are some devices which are very fragile in this regard even when the chip select signal is not asserted. Otherwise we have some new driver support, a bunch of small fixes and some general cleanup work. - Support for configuring the state of the MOSI pin when the the bus is idle - Add the Elgin JG0309-01 in spidev - Support for Marvell xSPI, Mediatek MTK7981, Microchip PIC64GX, NXP i.MX8ULP, and Rockchip RK3576 controllers I also accidentally pulled in an IIO DT bindings update due to a typo when applying the MOSI idle state patches" * tag 'spi-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (65 commits) spi: geni-qcom: Use devm functions to simplify code spi: remove spi_controller_is_slave() and spi_slave_abort() platform/olpc: olpc-xo175-ec: switch to use spi_target_abort(). spi: slave-mt27xx: switch to use target_abort spi: spidev: switch to use spi_target_abort() spi: slave-system-control: switch to use spi_target_abort() spi: slave-time: switch to use spi_target_abort() spi: switch to use spi_controller_is_target() spi: fspi: add support for imx8ulp spi: fspi: involve lut_num for struct nxp_fspi_devtype_data dt-bindings: spi: nxp-fspi: add imx8ulp support spi: spidev_fdx: Fix the wrong format specifier spi: mxs: Switch to RUNTIME/SYSTEM_SLEEP_PM_OPS() spi: dt-bindings: Add rockchip,rk3576-spi compatible spi: Revert "spi: Insert the missing pci_dev_put()before return" spi: zynq-qspi: Replace kzalloc with kmalloc for buffer allocation spi: ppc4xx: Sort headers spi: ppc4xx: Revert "handle irq_of_parse_and_map() errors" spi: zynqmp-gqspi: Simplify with dev_err_probe() spi: zynqmp-gqspi: Use devm_spi_alloc_host() ...
2024-09-16landlock: Add signal scopingTahera Fahimi
Currently, a sandbox process is not restricted to sending a signal (e.g. SIGKILL) to a process outside the sandbox environment. The ability to send a signal for a sandboxed process should be scoped the same way abstract UNIX sockets are scoped. Therefore, we extend the "scoped" field in a ruleset with LANDLOCK_SCOPE_SIGNAL to specify that a ruleset will deny sending any signal from within a sandbox process to its parent (i.e. any parent sandbox or non-sandboxed processes). This patch adds file_set_fowner and file_free_security hooks to set and release a pointer to the file owner's domain. This pointer, fown_domain in landlock_file_security will be used in file_send_sigiotask to check if the process can send a signal. The ruleset_with_unknown_scope test is updated to support LANDLOCK_SCOPE_SIGNAL. This depends on two new changes: - commit 1934b212615d ("file: reclaim 24 bytes from f_owner"): replace container_of(fown, struct file, f_owner) with fown->file . - commit 26f204380a3c ("fs: Fix file_set_fowner LSM hook inconsistencies"): lock before calling the hook. Signed-off-by: Tahera Fahimi <fahimitahera@gmail.com> Closes: https://github.com/landlock-lsm/linux/issues/8 Link: https://lore.kernel.org/r/df2b4f880a2ed3042992689a793ea0951f6798a5.1725657727.git.fahimitahera@gmail.com [mic: Update landlock_get_current_domain()'s return type, improve and fix locking in hook_file_set_fowner(), simplify and fix sleepable call and locking issue in hook_file_send_sigiotask() and rebase on the latest VFS tree, simplify hook_task_kill() and quickly return when not sandboxed, improve comments, rename LANDLOCK_SCOPED_SIGNAL] Co-developed-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Mickaël Salaün <mic@digikod.net>
2024-09-16landlock: Add abstract UNIX socket scopingTahera Fahimi
Introduce a new "scoped" member to landlock_ruleset_attr that can specify LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET to restrict connection to abstract UNIX sockets from a process outside of the socket's domain. Two hooks are implemented to enforce these restrictions: unix_stream_connect and unix_may_send. Closes: https://github.com/landlock-lsm/linux/issues/7 Signed-off-by: Tahera Fahimi <fahimitahera@gmail.com> Link: https://lore.kernel.org/r/5f7ad85243b78427242275b93481cfc7c127764b.1725494372.git.fahimitahera@gmail.com [mic: Fix commit message formatting, improve documentation, simplify hook_unix_may_send(), and cosmetic fixes including rename of LANDLOCK_SCOPED_ABSTRACT_UNIX_SOCKET] Co-developed-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Mickaël Salaün <mic@digikod.net>
2024-09-16Merge tag 'lsm-pr-20240911' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Move the LSM framework to static calls This transitions the vast majority of the LSM callbacks into static calls. Those callbacks which haven't been converted were left as-is due to the general ugliness of the changes required to support the static call conversion; we can revisit those callbacks at a future date. - Add the Integrity Policy Enforcement (IPE) LSM This adds a new LSM, Integrity Policy Enforcement (IPE). There is plenty of documentation about IPE in this patches, so I'll refrain from going into too much detail here, but the basic motivation behind IPE is to provide a mechanism such that administrators can restrict execution to only those binaries which come from integrity protected storage, e.g. a dm-verity protected filesystem. You will notice that IPE requires additional LSM hooks in the initramfs, dm-verity, and fs-verity code, with the associated patches carrying ACK/review tags from the associated maintainers. We couldn't find an obvious maintainer for the initramfs code, but the IPE patchset has been widely posted over several years. Both Deven Bowers and Fan Wu have contributed to IPE's development over the past several years, with Fan Wu agreeing to serve as the IPE maintainer moving forward. Once IPE is accepted into your tree, I'll start working with Fan to ensure he has the necessary accounts, keys, etc. so that he can start submitting IPE pull requests to you directly during the next merge window. - Move the lifecycle management of the LSM blobs to the LSM framework Management of the LSM blobs (the LSM state buffers attached to various kernel structs, typically via a void pointer named "security" or similar) has been mixed, some blobs were allocated/managed by individual LSMs, others were managed by the LSM framework itself. Starting with this pull we move management of all the LSM blobs, minus the XFRM blob, into the framework itself, improving consistency across LSMs, and reducing the amount of duplicated code across LSMs. Due to some additional work required to migrate the XFRM blob, it has been left as a todo item for a later date; from a practical standpoint this omission should have little impact as only SELinux provides a XFRM LSM implementation. - Fix problems with the LSM's handling of F_SETOWN The LSM hook for the fcntl(F_SETOWN) operation had a couple of problems: it was racy with itself, and it was disconnected from the associated DAC related logic in such a way that the LSM state could be updated in cases where the DAC state would not. We fix both of these problems by moving the security_file_set_fowner() hook into the same section of code where the DAC attributes are updated. Not only does this resolve the DAC/LSM synchronization issue, but as that code block is protected by a lock, it also resolve the race condition. - Fix potential problems with the security_inode_free() LSM hook Due to use of RCU to protect inodes and the placement of the LSM hook associated with freeing the inode, there is a bit of a challenge when it comes to managing any LSM state associated with an inode. The VFS folks are not open to relocating the LSM hook so we have to get creative when it comes to releasing an inode's LSM state. Traditionally we have used a single LSM callback within the hook that is triggered when the inode is "marked for death", but not actually released due to RCU. Unfortunately, this causes problems for LSMs which want to take an action when the inode's associated LSM state is actually released; so we add an additional LSM callback, inode_free_security_rcu(), that is called when the inode's LSM state is released in the RCU free callback. - Refactor two LSM hooks to better fit the LSM return value patterns The vast majority of the LSM hooks follow the "return 0 on success, negative values on failure" pattern, however, there are a small handful that have unique return value behaviors which has caused confusion in the past and makes it difficult for the BPF verifier to properly vet BPF LSM programs. This includes patches to convert two of these"special" LSM hooks to the common 0/-ERRNO pattern. - Various cleanups and improvements A handful of patches to remove redundant code, better leverage the IS_ERR_OR_NULL() helper, add missing "static" markings, and do some minor style fixups. * tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (40 commits) security: Update file_set_fowner documentation fs: Fix file_set_fowner LSM hook inconsistencies lsm: Use IS_ERR_OR_NULL() helper function lsm: remove LSM_COUNT and LSM_CONFIG_COUNT ipe: Remove duplicated include in ipe.c lsm: replace indirect LSM hook calls with static calls lsm: count the LSMs enabled at compile time kernel: Add helper macros for loop unrolling init/main.c: Initialize early LSMs after arch code, static keys and calls. MAINTAINERS: add IPE entry with Fan Wu as maintainer documentation: add IPE documentation ipe: kunit test for parser scripts: add boot policy generation program ipe: enable support for fs-verity as a trust provider fsverity: expose verified fsverity built-in signatures to LSMs lsm: add security_inode_setintegrity() hook ipe: add support for dm-verity as a trust provider dm-verity: expose root hash digest and signature data to LSMs block,lsm: add LSM blob and new LSM hooks for block devices ipe: add permissive toggle ...
2024-09-17Merge tag 'amd-drm-next-6.12-2024-09-13' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.12-2024-09-13: amdgpu: - GPUVM sync fixes - kdoc fixes - Misc spelling mistakes - Add some raven GFXOFF quirks - Use clamp helper - DC fixes - JPEG fixes - Process isolation fix - Queue reset fix - W=1 cleanup - SMU14 fixes - JPEG fixes amdkfd: - Fetch cacheline info from IP discovery - Queue reset fix - RAS fix - Document SVM events - CRIU fixes - Race fix in dma-buf handling drm: - dma-buf fd race fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240913134139.2861073-1-alexander.deucher@amd.com
2024-09-16Merge tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring async discard support from Jens Axboe: "Sitting on top of both the 6.12 block and io_uring core branches, here's support for async discard through io_uring. This allows applications to issue async discards, rather than rely on the blocking sync ioctl discards we already have. The sync support is difficult to use outside of idle/cleanup periods. On a real (but slow) device, testing shows the following results when compared to sync discard: qd64 sync discard: 21K IOPS, lat avg 3 msec (max 21 msec) qd64 async discard: 76K IOPS, lat avg 845 usec (max 2.2 msec) qd64 sync discard: 14K IOPS, lat avg 5 msec (max 25 msec) qd64 async discard: 56K IOPS, lat avg 1153 usec (max 3.6 msec) and synthetic null_blk testing with the same queue depth and block size settings as above shows: Type Trim size IOPS Lat avg (usec) Lat Max (usec) ============================================================== sync 4k 144K 444 20314 async 4k 1353K 47 595 sync 1M 56K 1136 21031 async 1M 94K 680 760" * tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux: block: implement async io_uring discard cmd block: introduce blk_validate_byte_range() filemap: introduce filemap_invalidate_pages io_uring/cmd: give inline space in request to cmds io_uring/cmd: expose iowq to cmds
2024-09-16Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD changes via Song: - md-bitmap refactoring (Yu Kuai) - raid5 performance optimization (Artur Paszkiewicz) - Other small fixes (Yu Kuai, Chen Ni) - Add a sysfs entry 'new_level' (Xiao Ni) - Improve information reported in /proc/mdstat (Mateusz Kusiak) - NVMe changes via Keith: - Asynchronous namespace scanning (Stuart) - TCP TLS updates (Hannes) - RDMA queue controller validation (Niklas) - Align field names to the spec (Anuj) - Metadata support validation (Puranjay) - A syntax cleanup (Shen) - Fix a Kconfig linking error (Arnd) - New queue-depth quirk (Keith) - Add missing unplug trace event (Keith) - blk-iocost fixes (Colin, Konstantin) - t10-pi modular removal and fixes (Alexey) - Fix for potential BLKSECDISCARD overflow (Alexey) - bio splitting cleanups and fixes (Christoph) - Deal with folios rather than rather than pages, speeding up how the block layer handles bigger IOs (Kundan) - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike) - Reduce zoned device overhead in ublk (Ming) - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir) - Fix regression in partition error pointer checking (Riyan) - Add support for write zeroes and rotational status in nbd (Wouter) - Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained for quite a while. - Various sets of fixes for BFQ (Yu Kuai) - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail, Yang) * tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits) nvme-pci: qdepth 1 quirk block: fix potential invalid pointer dereference in blk_add_partition blk_iocost: make read-only static array vrate_adj_pct const block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation nvme-tcp: fix link failure for TCP auth blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() ...
2024-09-16Merge tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - NAPI fixes and cleanups (Pavel, Olivier) - Add support for absolute timeouts (Pavel) - Fixes for io-wq/sqpoll affinities (Felix) - Efficiency improvements for dealing with huge pages (Chenliang) - Support for a minwait mode, where the application essentially has two timouts - one smaller one that defines the batch timeout, and the overall large one similar to what we had before. This enables efficient use of batching based on count + timeout, while still working well with periods of less intensive workloads - Use ITER_UBUF for single segment sends - Add support for incremental buffer consumption. Right now each operation will always consume a full buffer. With incremental consumption, a recv/read operation only consumes the part of the buffer that it needs to satisfy the operation - Add support for GCOV for io_uring, to help retain a high coverage of test to code ratio - Fix regression with ocfs2, where an odd -EOPNOTSUPP wasn't correctly converted to a blocking retry - Add support for cloning registered buffers from one ring to another - Misc cleanups (Anuj, me) * tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux: (35 commits) io_uring: add IORING_REGISTER_COPY_BUFFERS method io_uring/register: provide helper to get io_ring_ctx from 'fd' io_uring/rsrc: add reference count to struct io_mapped_ubuf io_uring/rsrc: clear 'slot' entry upfront io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout ...
2024-09-16Merge tag 'vfs-6.12.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "Recently, we added the ability to list mounts in other mount namespaces and the ability to retrieve namespace file descriptors without having to go through procfs by deriving them from pidfds. This extends nsfs in two ways: (1) Add the ability to retrieve information about a mount namespace via NS_MNT_GET_INFO. This will return the mount namespace id and the number of mounts currently in the mount namespace. The number of mounts can be used to size the buffer that needs to be used for listmount() and is in general useful without having to actually iterate through all the mounts. The structure is extensible. (2) Add the ability to iterate through all mount namespaces over which the caller holds privilege returning the file descriptor for the next or previous mount namespace. To retrieve a mount namespace the caller must be privileged wrt to it's owning user namespace. This means that PID 1 on the host can list all mounts in all mount namespaces or that a container can list all mounts of its nested containers. Optionally pass a structure for NS_MNT_GET_INFO with NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace in one go. (1) and (2) can be implemented for other namespace types easily. Together with recent api additions this means one can iterate through all mounts in all mount namespaces without ever touching procfs. The commit message in 49224a345c48 ('Merge patch series "nsfs: iterate through mount namespaces"') contains example code how to do this" * tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nsfs: iterate through mount namespaces file: add fput() cleanup helper fs: add put_mnt_ns() cleanup helper fs: allow mount namespace fd
2024-09-16Merge tag 'vfs-6.12.fallocate' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fallocate updates from Christian Brauner: "This contains work to try and cleanup some the fallocate mode handling. Currently, it confusingly mixes operation modes and an optional flag. The work here tries to better define operation modes and optional flags allowing the core and filesystem code to use switch statements to switch on the operation mode" * tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: refactor xfs_file_fallocate xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space xfs: call xfs_flush_unmap_range from xfs_free_file_space fs: sort out the fallocate mode vs flag mess ext4: remove tracing for FALLOC_FL_NO_HIDE_STALE block: remove checks for FALLOC_FL_NO_HIDE_STALE
2024-09-16Merge tag 'vfs-6.12.misc' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT | O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...
2024-09-16Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The highlights are support for Arm's "Permission Overlay Extension" using memory protection keys, support for running as a protected guest on Android as well as perf support for a bunch of new interconnect PMUs. Summary: ACPI: - Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms. - Ensure arm64-specific IORT header is covered by MAINTAINERS. CPU Errata: - Enable workaround for hardware access/dirty issue on Ampere-1A cores. Memory management: - Define PHYSMEM_END to fix a crash in the amdgpu driver. - Avoid tripping over invalid kernel mappings on the kexec() path. - Userspace support for the Permission Overlay Extension (POE) using protection keys. Perf and PMUs: - Add support for the "fixed instruction counter" extension in the CPU PMU architecture. - Extend and fix the event encodings for Apple's M1 CPU PMU. - Allow LSM hooks to decide on SPE permissions for physical profiling. - Add support for the CMN S3 and NI-700 PMUs. Confidential Computing: - Add support for booting an arm64 kernel as a protected guest under Android's "Protected KVM" (pKVM) hypervisor. Selftests: - Fix vector length issues in the SVE/SME sigreturn tests - Fix build warning in the ptrace tests. Timers: - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with non-determinism arising from the architected counter. Miscellaneous: - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs don't succeed. - Minor fixes and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits) perf: arm-ni: Fix an NULL vs IS_ERR() bug arm64: hibernate: Fix warning for cast from restricted gfp_t arm64: esr: Define ESR_ELx_EC_* constants as UL arm64: pkeys: remove redundant WARN perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled MAINTAINERS: List Arm interconnect PMUs as supported perf: Add driver for Arm NI-700 interconnect PMU dt-bindings/perf: Add Arm NI-700 PMU perf/arm-cmn: Improve format attr printing perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check arm64/mm: use lm_alias() with addresses passed to memblock_free() mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags() arm64: Expose the end of the linear map in PHYSMEM_END arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec() arm64/mm: Delete __init region from memblock.reserved perf/arm-cmn: Support CMN S3 dt-bindings: perf: arm-cmn: Add CMN S3 perf/arm-cmn: Refactor DTC PMU register access perf/arm-cmn: Make cycle counts less surprising perf/arm-cmn: Improve build-time assertion ...
2024-09-14io_uring: rename "copy buffers" to "clone buffers"Jens Axboe
A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer data is done here. What is being done is simply cloning the buffer registrations from one ring to another. Rename it while we still can, so that it's more descriptive. No functional changes in this patch. Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13net: fib_rules: Add DSCP selector attributeIdo Schimmel
The FIB rule TOS selector is implemented differently between IPv4 and IPv6. In IPv4 it is used to match on the three "Type of Services" bits specified in RFC 791, while in IPv6 is it is used to match on the six DSCP bits specified in RFC 2474. Add a new FIB rule attribute to allow matching on DSCP. The attribute will be used to implement a 'dscp' selector in ip-rule with a consistent behavior between IPv4 and IPv6. For now, set the type of the attribute to 'NLA_REJECT' so that user space will not be able to configure it. This restriction will be lifted once both IPv4 and IPv6 support the new attribute. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240911093748.3662015-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-13RDMA/nldev: Expose whether RDMA monitoring is supportedChiara Meiohas
Extend the "rdma sys" command to display whether RDMA monitoring is supported. RDMA monitoring is not supported in mlx4 because it does not use the ib_device_set_netdev() API, which sends the RDMA events. Example output for kernel where monitoring is supported: $ rdma sys show netns shared privileged-qkey off monitor on copy-on-fork on Example output for kernel where monitoring is not supported: $ rdma sys show netns shared privileged-qkey off monitor off copy-on-fork on Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909173025.30422-8-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13RDMA/nldev: Add support for RDMA monitoringChiara Meiohas
Introduce a new netlink command to allow rdma event monitoring. The rdma events supported now are IB device registration/unregistration and net device attachment/detachment. Example output of rdma monitor and the commands which trigger the events: $ rdma monitor $ rmmod mlx5_ib [UNREGISTER] dev 1 rocep8s0f1 [UNREGISTER] dev 0 rocep8s0f0 $ modprobe mlx5_ib [REGISTER] dev 2 mlx5_0 [NETDEV_ATTACH] dev 2 mlx5_0 port 1 netdev 4 eth2 [REGISTER] dev 3 mlx5_1 [NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3 $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev [UNREGISTER] dev 2 rocep8s0f0 [REGISTER] dev 4 mlx5_0 [NETDEV_ATTACH] dev 4 mlx5_0 port 30 netdev 4 eth2 $ echo 4 > /sys/class/net/eth2/device/sriov_numvfs [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 2 netdev 7 eth4 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 3 netdev 8 eth5 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 4 netdev 9 eth6 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 5 netdev 10 eth7 [REGISTER] dev 5 mlx5_0 [NETDEV_ATTACH] dev 5 mlx5_0 port 1 netdev 11 eth8 [REGISTER] dev 6 mlx5_0 [NETDEV_ATTACH] dev 6 mlx5_0 port 1 netdev 12 eth9 [REGISTER] dev 7 mlx5_0 [NETDEV_ATTACH] dev 7 mlx5_0 port 1 netdev 13 eth10 [REGISTER] dev 8 mlx5_0 [NETDEV_ATTACH] dev 8 mlx5_0 port 1 netdev 14 eth11 $ echo 0 > /sys/class/net/eth2/device/sriov_numvfs [UNREGISTER] dev 5 rocep8s0f0v0 [UNREGISTER] dev 6 rocep8s0f0v1 [UNREGISTER] dev 7 rocep8s0f0v2 [UNREGISTER] dev 8 rocep8s0f0v3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 2 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 4 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 5 Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909173025.30422-7-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-12uapi: libc-compat: remove ipx leftoversJakub Kicinski
The uAPI headers for IPX were deleted 3 years ago in commit 6c9b40844751 ("net: Remove net/ipx.h and uapi/linux/ipx.h header files") Delete the leftover defines from libc-compat.h Link: https://patch.msgid.link/20240911002142.1508694-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts (sort of) and no adjacent changes. This merge reverts commit b3c9e65eb227 ("net: hsr: remove seqnr_lock") from net, as it was superseded by commit 430d67bdcb04 ("net: hsr: Use the seqnr lock for frames received via interlink port.") in net-next. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12io_uring: add IORING_REGISTER_COPY_BUFFERS methodJens Axboe
Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces the overhead of O_DIRECT IO. However, registrering buffers can take some time. Normally this isn't an issue as it's done at initialization time (and hence less critical), but for cases where rings can be created and destroyed as part of an IO thread pool, registering the same buffers for multiple rings become a more time sensitive proposition. As an example, let's say an application has an IO memory pool of 500G. Initial registration takes: Got 500 huge pages (each 1024MB) Registered 500 pages in 409 msec or about 0.4 seconds. If we go higher to 900 1GB huge pages being registered: Registered 900 pages in 738 msec which is, as expected, a fully linear scaling. Rather than have each ring pin/map/register the same buffer pool, provide an io_uring_register(2) opcode to simply duplicate the buffers that are registered with another ring. Adding the same 900GB of registered buffers to the target ring can then be accomplished in: Copied 900 pages in 17 usec While timing differs a bit, this provides around a 25,000-40,000x speedup for this use case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-12spi: Merge up fixesMark Brown
A patch for Qualcomm depends on some fixes.
2024-09-11net: ethernet: oa_tc6: implement internal PHY initializationParthiban Veerasooran
Internal PHY is initialized as per the PHY register capability supported by the MAC-PHY. Direct PHY Register Access Capability indicates if PHY registers are directly accessible within the SPI register memory space. Indirect PHY Register Access Capability indicates if PHY registers are indirectly accessible through the MDIO/MDC registers MDIOACCn defined in OPEN Alliance specification. Currently the direct register access is only supported. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com> Link: https://patch.msgid.link/20240909082514.262942-7-Parthiban.Veerasooran@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11netdev: add dmabuf introspectionMina Almasry
Add dmabuf information to page_pool stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get ... {'dmabuf': 10, 'id': 456, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 455, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 454, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 453, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 452, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 451, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 450, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 449, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, And queue stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get ... {'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'}, Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-14-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11net: add SO_DEVMEM_DONTNEED setsockopt to release RX fragsMina Almasry
Add an interface for the user to notify the kernel that it is done reading the devmem dmabuf frags returned as cmsg. The kernel will drop the reference on the frags to make them available for reuse. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240910171458.219195-11-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11tcp: RX path for devmem TCPMina Almasry
In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11net: netdev netlink api to bind dma-buf to a net deviceMina Almasry
API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to. Suggested-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-3-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11Merge branch 'tip/sched/core' into sched_ext/for-6.12Tejun Heo
Pull in tip/sched/core to resolve two merge conflicts: - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") A simple context conflict. The former added __update_blocked_others() in the same #ifdef CONFIG_SMP block that effective_cpu_util() and sched_cpu_util() are in and the latter moved those functions to fair.c. This makes __update_blocked_others() more out of place. Will follow up with a patch to relocate. - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 84d265281d6c ("sched/pelt: Use rq_clock_task() for hw_pressure") The former factored out the body of __update_blocked_others() into update_other_load_avgs(). The latter changed how update_hw_load_avg() is called in the body. Resolved by applying the change to update_other_load_avgs() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11block: implement async io_uring discard cmdPavel Begunkov
io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11Merge branch 'for-6.12/io_uring' into for-6.12/io_uring-discardJens Axboe
* for-6.12/io_uring: (31 commits) io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout io_uring: add support for batch wait timeout io_uring: implement our own schedule timeout handling io_uring: move schedule wait logic into helper io_uring: encapsulate extraneous wait flags into a separate struct ...
2024-09-11Merge branch 'for-6.12/block' into for-6.12/io_uring-discardJens Axboe
* for-6.12/block: (115 commits) block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() md: Add new_level sysfs interface zram: Shrink zram_table_entry::flags. zram: Remove ZRAM_LOCK zram: Replace bit spinlocks with a spinlock_t. ...
2024-09-11sched/deadline: Clarify nanoseconds in uapiChristian Loehle
Specify the time values of the deadline parameters of deadline, runtime, and period as being in nanoseconds explicitly as they always have been. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20240813144348.1180344-3-christian.loehle@arm.com
2024-09-11Merge v6.11-rc7 into drm-nextSimona Vetter
Thomas needs 5a498d4d06d6 ("drm/fbdev-dma: Only install deferred I/O if necessary") in drm-misc, so start the backmerge cascade. Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
2024-09-10PCI: Rename CRS Completion Status to RRSBjorn Helgaas
PCIe r6.0 changed the abbreviation for "Configuration Request Retry Status" Completion Status from "CRS" to "RRS" and uses the terminology of "Configuration RRS Software Visibility" instead of "CRS Software Visibility". Align the Linux usage with the r6.0 spec language. No functional change intended. It's confusing to make this change, but I think "RRS" *is* a better abbreviation because it was easy to interpret "CRS" as "Completion Retry Status", which really didn't make any sense. Link: https://lore.kernel.org/r/20240827234848.4429-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2024-09-10net-timestamp: introduce SOF_TIMESTAMPING_OPT_RX_FILTER flagJason Xing
introduce a new flag SOF_TIMESTAMPING_OPT_RX_FILTER in the receive path. User can set it with SOF_TIMESTAMPING_SOFTWARE to filter out rx software timestamp report, especially after a process turns on netstamp_needed_key which can time stamp every incoming skb. Previously, we found out if an application starts first which turns on netstamp_needed_key, then another one only passing SOF_TIMESTAMPING_SOFTWARE could also get rx timestamp. Now we handle this case by introducing this new flag without breaking users. Quoting Willem to explain why we need the flag: "why a process would want to request software timestamp reporting, but not receive software timestamp generation. The only use I see is when the application does request SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_TX_SOFTWARE." Similarly, this new flag could also be used for hardware case where we can set it with SOF_TIMESTAMPING_RAW_HARDWARE, then we won't receive hardware receive timestamp. Another thing about errqueue in this patch I have a few words to say: In this case, we need to handle the egress path carefully, or else reporting the tx timestamp will fail. Egress path and ingress path will finally call sock_recv_timestamp(). We have to distinguish them. Errqueue is a good indicator to reflect the flow direction. Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240909015612.3856-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-10Merge branch 'for-linus' into for-nextTakashi Iwai
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-09-10vdpa: support set mac address from vdpa toolCindy Lu
Add new UAPI to support the mac address from vdpa tool Function vdpa_nl_cmd_dev_attr_set_doit() will get the new MAC address from the vdpa tool and then set it to the device. The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:** Here is example: root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "82:4d:e9:5d:d7:e6", "link ": "up", "link_announce ": false, "mtu": 1500 } } } root@L1# vdpa dev set name vdpa0 mac 00:11:22:33:44:55 root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "00:11:22:33:44:55", "link ": "up", "link_announce ": false, "mtu": 1500 } } } Signed-off-by: Cindy Lu <lulu@redhat.com> Message-Id: <20240731031653.1047692-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-09-10virtio_balloon: introduce memory scan/reclaim infozhenwei pi
Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-5-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10virtio_balloon: introduce memory allocation stall counterzhenwei pi
Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-4-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10virtio_balloon: introduce oom-kill invocationszhenwei pi
When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-3-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-09Merge 6.11-rc7 into usb-nextGreg Kroah-Hartman
We need the USB fixes in here as well, and this also resolves the merge conflict in: drivers/usb/typec/ucsi/ucsi.c Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-09Merge 6.11-rc7 into char-misc-nextGreg Kroah-Hartman
We need the char-misc fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-08ptp/ioctl: support MONOTONIC{,_RAW} timestamps for PTP_SYS_OFFSET_EXTENDEDMahesh Bandewar
The ability to read the PHC (Physical Hardware Clock) alongside multiple system clocks is currently dependent on the specific hardware architecture. This limitation restricts the use of PTP_SYS_OFFSET_PRECISE to certain hardware configurations. The generic soultion which would work across all architectures is to read the PHC along with the latency to perform PHC-read as offered by PTP_SYS_OFFSET_EXTENDED which provides pre and post timestamps. However, these timestamps are currently limited to the CLOCK_REALTIME timebase. Since CLOCK_REALTIME is affected by NTP (or similar time synchronization services), it can experience significant jumps forward or backward. This hinders the precise latency measurements that PTP_SYS_OFFSET_EXTENDED is designed to provide. This problem could be addressed by supporting MONOTONIC_RAW timestamps within PTP_SYS_OFFSET_EXTENDED. Unlike CLOCK_REALTIME or CLOCK_MONOTONIC, the MONOTONIC_RAW timebase is unaffected by NTP adjustments. This enhancement can be implemented by utilizing one of the three reserved words within the PTP_SYS_OFFSET_EXTENDED struct to pass the clock-id for timestamps. The current behavior aligns with clock-id for CLOCK_REALTIME timebase (value of 0), ensuring backward compatibility of the UAPI. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: David S. Miller <davem@davemloft.net>