Age | Commit message (Collapse) | Author |
|
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7.
When upreving llvm I realised that kexec stopped working on my test
platform.
The reason seems to be that due to PGO there are multiple .text sections
on the purgatory, and kexec does not supports that.
This patch (of 4):
Clang16 links the purgatory text in two sections when PGO is in use:
[ 1] .text PROGBITS 0000000000000000 00000040
00000000000011a1 0000000000000000 AX 0 0 16
[ 2] .rela.text RELA 0000000000000000 00003498
0000000000000648 0000000000000018 I 24 1 8
...
[17] .text.hot. PROGBITS 0000000000000000 00003220
000000000000020b 0000000000000000 AX 0 0 1
[18] .rela.text.hot. RELA 0000000000000000 00004428
0000000000000078 0000000000000018 I 24 17 8
And both of them have their range [sh_addr ... sh_addr+sh_size] on the
area pointed by `e_entry`.
This causes that image->start is calculated twice, once for .text and
another time for .text.hot. The second calculation leaves image->start
in a random location.
Because of this, the system crashes immediately after:
kexec_core: Starting new kernel
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org
Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Philipp Rudo <prudo@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We used to not pass in the pgoff correctly when register/unregister uffd
regions, it caused incorrect behavior on vma merging and can cause
mergeable vmas being separate after ioctls return.
For example, when we have:
vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
Then someone unregisters uffd on range (5-9), it should logically become:
vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
But with current code we'll have:
vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
This patch allows such merge to happen correctly before ioctl returns.
This behavior seems to have existed since the 1st day of uffd. Since
pgoff for vma_merge() is only used to identify the possibility of vma
merging, meanwhile here what we did was always passing in a pgoff smaller
than what we should, so there should have no other side effect besides not
merging it. Let's still tentatively copy stable for this, even though I
don't see anything will go wrong besides vma being split (which is mostly
not user visible).
Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/uffd: Fix vma merge/split", v2.
This series contains two patches that fix vma merge/split for userfaultfd
on two separate issues.
Patch 1 fixes a regression since 6.1+ due to something we overlooked when
converting to maple tree apis. The plan is we use patch 1 to replace the
commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to
vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring
uffd vma operations back aligned with the rest code again.
Patch 2 fixes a long standing issue that vma can be left unmerged even if
we can for either uffd register or unregister.
Many thanks to Lorenzo on either noticing this issue from the assert
movement patch, looking at this problem, and also provided a reproducer on
the unmerged vma issue [1].
[1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e
This patch (of 2):
It seems vma merging with uffd paths is broken with either
register/unregister, where right now we can feed wrong parameters to
vma_merge() and it's found by recent patch which moved asserts upwards in
vma_merge() by Lorenzo Stoakes:
https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
It's possible that "start" is contained within vma but not clamped to its
start. We need to convert this into either "cannot merge" case or "can
merge" case 4 which permits subdivision of prev by assigning vma to prev.
As we loop, each subsequent VMA will be clamped to the start.
This patch will eliminate the report and make sure vma_merge() calls will
become legal again.
One thing to mention is that the "Fixes: 29417d292bd0" below is there only
to help explain where the warning can start to trigger, the real commit to
fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the
issue, but unfortunately we may want to keep it in Fixes too just to ease
kernel backporters for easier tracking.
Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com
Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The xarray.c file contains the only call to radix_tree_node_rcu_free(),
and it comes with its own extern declaration for it. This means the
function definition causes a missing-prototype warning:
lib/radix-tree.c:288:6: error: no previous prototype for 'radix_tree_node_rcu_free' [-Werror=missing-prototypes]
Instead, move the declaration for this function to a new header that can
be included by both, and do the same for the radix_tree_node_cachep
variable that has the same underlying problem but does not cause a warning
with gcc.
[zhangpeng.00@bytedance.com: fix building radix tree test suite]
Link: https://lkml.kernel.org/r/20230521095450.21332-1-zhangpeng.00@bytedance.com
Link: https://lkml.kernel.org/r/20230516194212.548910-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A syzbot fault injection test reported that nilfs_btnode_create_block, a
helper function that allocates a new node block for b-trees, causes a
kernel BUG for disk images where the file system block size is smaller
than the page size.
This was due to unexpected flags on the newly allocated buffer head, and
it turned out to be because the buffer flags were not cleared by
nilfs_btnode_abort_change_key() after an error occurred during a b-tree
update operation and the buffer was later reused in that state.
Fix this issue by using nilfs_btnode_delete() to abandon the unused
preallocated buffer in nilfs_btnode_abort_change_key().
Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"User events:
- Use long instead of int for storing the enable set/clear bit, as it
was found that big endian machines could end up using the wrong
bits.
- Split allocating mm and attaching it. This keeps the allocation
separate from the registration and avoids various races.
- Remove RCU locking around pin_user_pages_remote() as that can
schedule. The RCU protection is no longer needed with the above
split of mm allocation and attaching.
- Rename the "link" fields of the various structs to something more
meaningful.
- Add comments around user_event_mm struct usage and locking
requirements.
Timerlat tracer:
- Fix missed wakeup of timerlat thread caused by the timerlat
interrupt triggering when tracing is off. The timer interrupt
handler needs to always wake up the timerlat thread regardless if
tracing is enabled or not, otherwise, it will never wake up.
Histograms:
- Fix regression of breaking the "stacktrace" modifier for variables.
That modifier cannot be used for values, but can be used for
variables that are passed from one histogram to the next. This was
broken when adding the restriction to values as the variable logic
used the same code.
- Rename the special field "stacktrace" to "common_stacktrace".
Special fields (that are not actually part of the event, but can
act just like event fields, like 'comm' and 'timestamp') should be
prefixed with 'common_' for consistency. To keep backward
compatibility, 'stacktrace' can still be used (as with the special
field 'cpu'), but can be overridden if the event has a field called
'stacktrace'.
- Update the synthetic event selftests to use the new name (synthetic
events are created by histograms)
Tracing bootup selftests:
- Reorganize the code to keep artifacts of the selftests not compiled
in when selftests are not configured.
- Add various cond_resched() around the selftest code, as the
softlock watchdog was triggering much more often. It appears that
the kernel runs slower now with full debugging enabled.
- While debugging ftrace with ftrace (using an instance ring buffer
instead of the top level one), I found that the selftests were
disabling prints to the debug instance.
This should not happen, as the selftests only disable printing to
the main buffer as the selftests examine the main buffer to see if
it has what it expects, and prints can make the tests fail.
Make the selftests only disable printing to the toplevel buffer,
and leave the instance buffers alone"
* tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have function_graph selftest call cond_resched()
tracing: Only make selftest conditionals affect the global_trace
tracing: Make tracing_selftest_running/delete nops when not used
tracing: Have tracer selftests call cond_resched() before running
tracing: Move setting of tracing_selftest_running out of register_tracer()
tracing/selftests: Update synthetic event selftest to use common_stacktrace
tracing: Rename stacktrace field to common_stacktrace
tracing/histograms: Allow variables to have some modifiers
tracing/user_events: Document user_event_mm one-shot list usage
tracing/user_events: Rename link fields for clarity
tracing/user_events: Remove RCU lock while pinning pages
tracing/user_events: Split up mm alloc and attach
tracing/timerlat: Always wakeup the timerlat thread
tracing/user_events: Use long vs int for atomic bit ops
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"Fix an alignment crash in x86/aria"
* tag 'v6.4-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors
|
|
This reverts commit 9828ed3f695a138f7add89fa2a186ababceb8006.
Sadly, it does seem to cause failures to load modules. Johan Hovold reports:
"This change breaks module loading during boot on the Lenovo Thinkpad
X13s (aarch64).
Specifically it results in indefinite probe deferral of the display
and USB (ethernet) which makes it a pain to debug. Typing in the dark
to acquire some logs reveals that other modules are missing as well"
Since this was applied late as a "let's try this", I'm reverting it
asap, and we can try to figure out what goes wrong later. The excessive
parallel module loading problem is annoying, but not noticeable in
normal situations, and this was only meant as an optimistic workaround
for a user-space bug.
One possible solution may be to do the optimistic exclusive open first,
and then use a lock to serialize loading if that fails.
Reported-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/lkml/ZHRpH-JXAxA6DnzR@hovoldconsulting.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When all kernel debugging is enabled (lockdep, KSAN, etc), the function
graph enabling and disabling can take several seconds to complete. The
function_graph selftest enables and disables function graph tracing
several times. With full debugging enabled, the soft lockup watchdog was
triggering because the selftest was running without ever scheduling.
Add cond_resched() throughout the test to make sure it does not trigger
the soft lockup detector.
Link: https://lkml.kernel.org/r/20230528051742.1325503-6-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The tracing_selftest_running and tracing_selftest_disabled variables were
to keep trace_printk() and other writes from affecting the tracing
selftests, as the tracing selftests would examine the ring buffer to see
if it contained what it expected or not. trace_printk() and friends could
add to the ring buffer and cause the selftests to fail (and then disable
the tracer that was being tested). To keep that from happening, these
variables were added and would keep trace_printk() and friends from
writing to the ring buffer while the tests were going on.
But this was only the top level ring buffer (owned by the global_trace
instance). There is no reason to prevent writing into ring buffers of
other instances via the trace_array_printk() and friends. For the
functions that could be used by other instances, check if the global_trace
is the tracer instance that is being written to before deciding to not
allow the write.
Link: https://lkml.kernel.org/r/20230528051742.1325503-5-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
There's no reason to test the condition variables tracing_selftest_running
or tracing_selftest_delete when tracing selftests are not enabled. Make
them define 0s when not the selftests are not configured in.
Link: https://lkml.kernel.org/r/20230528051742.1325503-4-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
As there are more and more internal selftests being added to the Linux
kernel (KSAN, lockdep, etc) the selftests are taking longer to run when
these are enabled. Add a cond_resched() to the calling of
do_run_tracer_selftest() to force a schedule if NEED_RESCHED is set,
otherwise the soft lockup watchdog may trigger on boot up.
Link: https://lkml.kernel.org/r/20230528051742.1325503-3-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The variables tracing_selftest_running and tracing_selftest_disabled are
only used for when CONFIG_FTRACE_STARTUP_TEST is enabled. Make them only
visible within the selftest code. The setting of those variables are in
the register_tracer() call, and set in a location where they do not need
to be. Create a wrapper around run_tracer_selftest() called
do_run_tracer_selftest() which sets those variables, and have
register_tracer() call that instead.
Having those variables only set within the CONFIG_FTRACE_STARTUP_TEST
scope gets rid of them (and also the ability to remove testing against
them) when the startup tests are not enabled (most cases).
Link: https://lkml.kernel.org/r/20230528051742.1325503-2-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Pull phy fixes from Vinod Koul:
- init count imbalance fix in qcom-qmp-pcie and combo drivers
- kernel doc header fix for qcom-snps driver
- mediatek floating point comparison fix
- amlogic fix register value
* tag 'phy-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: qcom-snps: correct struct qcom_snps_hsphy kerneldoc
phy: amlogic: phy-meson-g12a-mipi-dphy-analog: fix CNTL2_DIF_TX_CTL0 value
phy: mediatek: rework the floating point comparisons to fixed point
phy: qcom-qmp-pcie-msm8996: fix init-count imbalance
phy: qcom-qmp-combo: fix init-count imbalance
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine fixes from Vinod Koul:
"Driver fixes for the at-hdmac, pl330, TI and IDXD drivers:
- AT HDMAC driver fixes for Flow Controller bitfield, peripheral ID
handling and potential NULL dereference check
- PL330 function rename to avoid conflicts
- build warning fix for pm function in TI driver
- IDXD driver fix for passing freed memory"
* tag 'dmaengine-fix-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
dmaengine: at_hdmac: Extend the Flow Controller bitfield to three bits
dmaengine: at_hdmac: Repair bitfield macros for peripheral ID handling
dmaengine: pl330: rename _start to prevent build error
dmaengine: at_xdmac: fix potential Oops in at_xdmac_prep_interleaved()
dmaengine: ti: k3-udma: annotate pm function with __maybe_unused
dmaengine: idxd: Fix passing freed memory in idxd_cdev_open()
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu fix from Thomas Gleixner:
"A single fix for x86:
- Prevent a bogus setting for the number of HT siblings, which is
caused by the CPUID evaluation trainwreck of X86. That recomputes
the value for each CPU, so the last CPU "wins". That can cause
completely bogus sibling values"
* tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A small set of perf fixes:
- Make the MSR-readout based CHA discovery work around broken
discovery tables in some SPR firmwares.
- Prevent saving PEBS configuration which has software bits set that
cause a crash when restored into the relevant MSR"
* tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/uncore: Correct the number of CHAs on SPR
perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull unwinder fixes from Thomas Gleixner:
"A set of unwinder and tooling fixes:
- Ensure that the stack pointer on x86 is aligned again so that the
unwinder does not read past the end of the stack
- Discard .note.gnu.property section which has a pointlessly
different alignment than the other note sections. That confuses
tooling of all sorts including readelf, libbpf and pahole"
* tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/show_trace_log_lvl: Ensure stack pointer is aligned, again
vmlinux.lds.h: Discard .note.gnu.property section
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fixes from Thomas Gleixner:
"Two fixes for debugobjects:
- Prevent the allocation path from waking up kswapd.
That's a long standing issue due to the GFP_ATOMIC allocation flag.
As debug objects can be invoked from pretty much any context waking
kswapd can end up in arbitrary lock chains versus the waitqueue
lock
- Correct the explicit lockdep wait-type violation in
debug_object_fill_pool()"
* tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Don't wake up kswapd from fill_pool()
debugobjects,locking: Annotate debug_object_fill_pool() wait type violation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for interrupt chip drivers:
- Prevent loss of state in the MIPS GIC interrupt controller
- Disable pseudo NMIs on Mediatek based Chromebooks as they have
firmware issues which cause instantenous chrashes and freezes wen
pseudo NMIs are used
- Fix the error handling path in the MBIGEN driver and a defined but
not used warning in the meson-gpio interrupt chip driver"
* tag 'irq-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mbigen: Unify the error handling in mbigen_of_create_domain()
irqchip/meson-gpio: Mark OF related data as maybe unused
irqchip/mips-gic: Use raw spinlock for gic_lock
irqchip/mips-gic: Don't touch vl_map if a local interrupt is not routable
irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues
dt-bindings: interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/ broken FW
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
- fixes to get alchemy platform back in shape
- fix for initrd detection
* tag 'mips-fixes_6.4_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
mips: Move initrd_start check after initrd address sanitisation.
MIPS: Alchemy: fix dbdma2
MIPS: Restore Au1300 support
MIPS: unhide PATA_PLATFORM
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
- Reinstate ARCH_FORCE_MAX_ORDER ranges to fix various breakage
* tag 'powerpc-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Reinstate ARCH_FORCE_MAX_ORDER ranges
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a double free fix in the Xen pvcalls backend driver
- a fix for a regression causing the MSI related sysfs entries to not
being created in Xen PV guests
- a fix in the Xen blkfront driver for handling insane input data
better
* tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/pci/xen: populate MSI sysfs entries
xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
xen/blkfront: Only check REQ_FUA for writes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are some small driver fixes for 6.4-rc4. They are just two
different types:
- binder fixes and reverts for reported problems and regressions in
the binder "driver".
- coresight driver fixes for reported problems.
All of these have been in linux-next for over a week with no reported
problems"
* tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
binder: fix UAF of alloc->vma in race with munmap()
binder: add lockless binder_alloc_(set|get)_vma()
Revert "android: binder: stop saving a pointer to the VMA"
Revert "binder_alloc: add missing mmap_lock calls when using the VMA"
binder: fix UAF caused by faulty buffer cleanup
coresight: perf: Release Coresight path when alloc trace id failed
coresight: Fix signedness bug in tmc_etr_buf_insert_barrier_packet()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull compute express link fixes from Dan Williams:
"The 'media ready' series prevents the driver from acting on bad
capacity information, and it moves some checks earlier in the init
sequence which impacts topics in the queue for 6.5.
Additional hotplug testing uncovered a missing enable for memory
decode. A debug crash fix is also included.
Summary:
- Stop trusting capacity data before the "media ready" indication
- Add missing HDM decoder capability enable for the cold-plug case
- Fix a debug message induced crash"
* tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Explicitly initialize resources when media is not ready
cxl/port: Fix NULL pointer access in devm_cxl_add_port()
cxl: Move cxl_await_media_ready() to before capacity info retrieval
cxl: Wait Memory_Info_Valid before access memory related info
cxl/port: Enable the HDM decoder capability for switch ports
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"There have not been a lot of fixes for for the soc tree in 6.4, but
these have been sitting here for too long.
For the devicetree side, there is one minor warning fix for vexpress,
the rest all all for the the NXP i.MX platforms: SoC specific bugfixes
for the iMX8 clocks and its USB-3.0 gadget device, as well as board
specific fixes for regulators and the phy on some of the i.MX boards.
The microchip risc-v and arm32 maintainers now also add a shared
maintainer file entry for the arm64 parts.
The remaining fixes are all for firmware drivers, addressing mistakes
in the optee, scmi and ff-a firmware driver implementation, mostly in
the error handling code, incorrect use of the alloc_workqueue()
interface in SCMI, and compatibility with corner cases of the firmware
implementation"
* tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
MAINTAINERS: update arm64 Microchip entries
arm64: dts: imx8: fix USB 3.0 Gadget Failure in QM & QXPB0 at super speed
dt-binding: cdns,usb3: Fix cdns,on-chip-buff-size type
arm64: dts: colibri-imx8x: delete adc1 and dsp
arm64: dts: colibri-imx8x: fix iris pinctrl configuration
arm64: dts: colibri-imx8x: move pinctrl property from SoM to eval board
arm64: dts: colibri-imx8x: fix eval board pin configuration
arm64: dts: imx8mp: Fix video clock parents
ARM: dts: imx6qdl-mba6: Add missing pvcie-supply regulator
ARM: dts: imx6ull-dhcor: Set and limit the mode for PMIC buck 1, 2 and 3
arm64: dts: imx8mn-var-som: fix PHY detection bug by adding deassert delay
arm64: dts: imx8mn: Fix video clock parents
firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
firmware: arm_ffa: Fix FFA device names for logical partitions
firmware: arm_ffa: Fix usage of partition info get count flag
firmware: arm_ffa: Check if ffa_driver remove is present before executing
arm64: dts: arm: add missing cache properties
ARM: dts: vexpress: add missing cache properties
firmware: arm_scmi: Fix incorrect alloc_workqueue() invocation
optee: fix uninited async notif value
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fix from Bjorn Helgaas:
- Quirk Ice Lake Root Ports to work around DPC log size issue (Mika
Westerberg)
* tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/DPC: Quirk PIO log size for Intel Ice Lake Root Ports
|
|
Pull VFIO fix from Alex Williamson:
- Test for and return error for invalid pfns through the pin pages
interface (Yan Zhao)
* tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio:
vfio/type1: check pfn valid before converting to struct page
|
|
Pull block fixes from Jens Axboe:
"A few fixes for the storage side of things:
- Fix bio caching condition for passthrough IO (Anuj)
- end-of-device check fix for zero sized devices (Christoph)
- Update Paolo's email address
- NVMe pull request via Keith with a single quirk addition
- Fix regression in how wbt enablement is done (Yu)
- Fix race in active queue accounting (Tian)"
* tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux:
NVMe: Add MAXIO 1602 to bogus nid list.
block: make bio_check_eod work for zero sized devices
block: fix bio-cache for passthru IO
block, bfq: update Paolo's address in maintainer list
blk-mq: fix race condition in active queue accounting
blk-wbt: fix that wbt can't be disabled by default
|
|
Pull io_uring fix from Jens Axboe:
"Just a single fix for the conditional schedule with the SQPOLL thread,
dropping the uring_lock if we do need to reschedule"
* tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux:
io_uring: unlock sqd->lock before sq thread release CPU
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"Fix a regression introduced inadvertently during the 6.3 cycle by a
commit making the Intel int340x thermal driver use sysfs_emit_at()
instead of scnprintf() (Srinivas Pandruvada)"
* tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: int340x: Add new line for UUID display
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix three issues related to the ->fast_switch callback in the AMD
P-state cpufreq driver (Gautham R. Shenoy and Wyes Karny)"
* tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf()
cpufreq: amd-pstate: Remove fast_switch_possible flag from active driver
cpufreq: amd-pstate: Add ->fast_switch() callback
|
|
When media is not ready do not assume that the capacity information from
the identify command is valid, i.e. ->total_bytes
->partition_align_bytes ->{volatile,persistent}_only_bytes. Explicitly
zero out the capacity resources and exit early.
Given zero-init of those fields this patch is functionally equivalent to
the prior state, but it improves readability and robustness going
forward.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/168506118166.3004974.13523455340007852589.stgit@djiang5-mobl3
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix incorrect output in in-tree gpio tools
- fix a shell coding issue in gpio-sim selftests
- correctly set the permissions for debugfs attributes exposed by
gpio-mockup
- fix chip name and pin count in gpio-f7188x for one of the supported
models
- fix numberspace pollution when using dynamically and statically
allocated GPIOs together
* tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio-f7188x: fix chip name and pin count on Nuvoton chip
gpiolib: fix allocation of mixed dynamic/static GPIOs
gpio: mockup: Fix mode of debugfs files
selftests: gpio: gpio-sim: Fix BUG: test FAILED due to recent change
tools: gpio: fix debounce_period_us output of lsgpio
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- handle memory allocation error in checksumming helper (reported by
syzbot)
- fix lockdep splat when aborting a transaction, add NOFS protection
around invalidate_inode_pages2 that could allocate with GFP_KERNEL
- reduce chances to hit an ENOSPC during scrub with RAID56 profiles
* tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: use nofs when cleaning up aborted transactions
btrfs: handle memory allocation failure in btrfs_csum_one_bio
btrfs: scrub: try harder to mark RAID56 block groups read-only
|
|
Pull drm fixes from Dave Airlie:
"This week's collection is pretty spread out, accel/qaic has a bunch of
fixes, amdgpu, then lots of single fixes across a bunch of places.
core:
- fix drmm_mutex_init lock class
mgag200:
- fix gamma lut initialisation
pl111:
- fix FB depth on IMPD-1 framebuffer
amdgpu:
- Fix missing BO unlocking in KIQ error path
- Avoid spurious secure display error messages
- SMU13 fix
- Fix an OD regression
- GPU reset display IRQ warning fix
- MST fix
radeon:
- Fix a DP regression
i915:
- PIPEDMC disabling fix for bigjoiner config
panel:
- fix aya neo air plus quirk
sched:
- remove redundant NULL check
qaic:
- fix NNC message corruption
- Grab ch_lock during QAIC_ATTACH_SLICE_BO
- Flush the transfer list again
- Validate if BO is sliced before slicing
- Validate user data before grabbing any lock
- initialize ret variable to 0
- silence some uninitialized variable warnings"
* tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Have Payload Properly Created After Resume
drm/amd/display: Fix warning in disabling vblank irq
drm/amd/pm: Fix output of pp_od_clk_voltage
drm/amd/pm: add missing NotifyPowerSource message mapping for SMU13.0.7
drm/radeon: reintroduce radeon_dp_work_func content
drm/amdgpu: don't enable secure display on incompatible platforms
drm:amd:amdgpu: Fix missing buffer object unlock in failure path
accel/qaic: Fix NNC message corruption
accel/qaic: Grab ch_lock during QAIC_ATTACH_SLICE_BO
accel/qaic: Flush the transfer list again
accel/qaic: Validate if BO is sliced before slicing
accel/qaic: Validate user data before grabbing any lock
accel/qaic: initialize ret variable to 0
drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration
drm: fix drmm_mutex_init()
drm/sched: Remove redundant check
drm: panel-orientation-quirks: Change Air's quirk to support Air Plus
accel/qaic: silence some uninitialized variable warnings
drm/pl111: Fix FB depth on IMPD-1 framebuffer
drm/mgag200: Fix gamma lut not initialized.
|
|
I tried to streamline our user memory copy code fairly aggressively in
commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory
copies"), in order to then be able to clean up the code and inline the
modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in
user copies for the FSRM case").
We had reports [1] of that causing regressions earlier with blogbench,
but that turned out to be a horrible benchmark for that case, and not a
sufficient reason for re-instating "rep movsb" on older machines.
However, now Eric Dumazet reported [2] a regression in performance that
seems to be a rather more real benchmark, where due to the removal of
"rep movs" a TCP stream over a 100Gbps network no longer reaches line
speed.
And it turns out that with the simplified the calling convention for the
non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep
'copy_user' function"), re-introducing the ERMS case is actually fairly
simple.
Of course, that "fairly simple" is glossing over several missteps due to
having to fight our assembler alternative code. This code really wanted
to rewrite a conditional branch to have two different targets, but that
made objtool sufficiently unhappy that this instead just ended up doing
a choice between "jump to the unrolled loop, or use 'rep movsb'
directly".
Let's see if somebody finds a case where the kernel memory copies also
care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for
small memory copies"). But Eric does argue that the user copies are
special because networking tries to copy up to 32KB at a time, if
order-3 pages allocations are possible.
In-kernel memory copies are typically small, unless they are the special
"copy pages at a time" kind that still use "rep movs".
Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1]
Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2]
Reported-and-tested-by: Eric Dumazet <edumazet@google.com>
Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull NVMe fix from Keith:
"nvme fixes for 6.4
One nvme quirk (Tatsuki)"
* tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme:
NVMe: Add MAXIO 1602 to bogus nid list.
|
|
HIKSEMI FUTURE M.2 SSD uses the same dummy nguid and eui64.
I confirmed it with my two devices.
This patch marks the controller as NVME_QUIRK_BOGUS_NID.
---------------------------------------------------------
sugi@tempest:~% sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x1e4b
ssvid : 0x1e4b
sn : 30096022612
mn : HS-SSD-FUTURE 2048G
fr : SN10542
rab : 0
ieee : 000000
cmic : 0
mdts : 7
cntlid : 0
ver : 0x10400
rtd3r : 0x7a120
rtd3e : 0x1e8480
oaes : 0x200
ctratt : 0x2
rrls : 0
cntrltype : 1
fguid : 00000000-0000-0000-0000-000000000000
<snip...>
---------------------------------------------------------
---------------------------------------------------------
sugi@tempest:~% sudo nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
<snip...>
nguid : 00000000000000000000000000000000
eui64 : 0000000000000002
lbaf 0 : ms:0 lbads:9 rp:0 (in use)
---------------------------------------------------------
Signed-off-by: Tatsuki Sugiura <sugi@nemui.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
Arm FF-A fixes for v6.4
Quite a few fixes to address set of assorted issues:
1. NULL pointer dereference if the ffa driver doesn't provide remove()
callback as it is currently executed unconditionally
2. FF-A core probe failure on systems with v1.0 firmware as the new
partition info get count flag is used unconditionally
3. Failure to register more than one logical partition or service within
the same physical partition as the device name contains only VM ID
which will be same for all but each will have unique UUID.
4. Rejection of certain memory interface transmissions by the receivers
(secure partitions) as few MBZ fields are non-zero due to lack of
explicit re-initialization of those fields
* tag 'ffa-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
firmware: arm_ffa: Fix FFA device names for logical partitions
firmware: arm_ffa: Fix usage of partition info get count flag
firmware: arm_ffa: Check if ffa_driver remove is present before executing
Link: https://lore.kernel.org/r/20230509143453.1188753-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
drm-misc-fixes for v6.4-rc4:
- A few non-trivial fixes to qaic.
- Fix drmm_mutex_init always using same lock class.
- Fix pl111 fb depth.
- Fix uninitialised gamma lut in mgag200.
- Add Aya Neo Air Plus quirk.
- Trivial null check removal in scheduler.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/d19f748c-2c5b-8140-5b05-a8282dfef73e@linux.intel.com
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.4-2023-05-24:
amdgpu:
- Fix missing BO unlocking in KIQ error path
- Avoid spurious secure display error messages
- SMU13 fix
- Fix an OD regression
- GPU reset display IRQ warning fix
- MST fix
radeon:
- Fix a DP regression
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230524211238.7749-1-alexander.deucher@amd.com
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
PIPEDMC disabling fix for bigjoiner config
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZG9aROGyc947/J1l@jlahtine-mobl.ger.corp.intel.com
|
|
Pull smb directory moves and client fixes from Steve French:
"Four smb3 client fixes (three of which marked for stable) and three
patches to move of fs/cifs and fs/ksmbd to a new common "fs/smb"
parent directory
- Move the client and server source directories to a common parent
directory:
fs/cifs -> fs/smb/client
fs/ksmbd -> fs/smb/server
fs/smbfs_common -> fs/smb/common
- important readahead fix
- important fix for SMB1 regression
- fix for missing mount option ("mapchars") in mount API conversion
- minor debugging improvement"
* tag '6.4-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: move Documentation/filesystems/cifs to Documentation/filesystems/smb
cifs: correct references in Documentation to old fs/cifs path
smb: move client and server files to common directory fs/smb
cifs: mapchars mount option ignored
smb3: display debug information better for encryption
cifs: fix smb1 mount regression
cifs: Fix cifs_limit_bvec_subset() to correctly check the maxmimum size
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
"Quite a bunch of real bugfixes in here and most of them are tagged for
backporting: A fix for cache flushing from irq context, a kprobes &
kgdb breakpoint handling fix, and a fix in the alternative code
patching function to take care of CPU hotplugging.
parisc now provides LOCKDEP support and comes with a lightweight
spinlock check. Both features helped me to find the cache flush bug.
Additionally writing the AGP gatt has been fixed, the machine allows
the user to reboot after a system halt and arch_sync_dma_for_cpu() has
been optimized for PCXL PCUs.
Summary:
- Fix flush_dcache_page() for usage from irq context
- Handle kprobes breakpoints only in kernel context
- Handle kgdb breakpoints only in kernel context
- Use num_present_cpus() in alternative patching code
- Enable LOCKDEP support
- Add lightweight spinlock checks
- Flush AGP gatt writes and adjust gatt mask in parisc_agp_mask_memory()
- Allow to reboot machine after system halt
- Improve cache flushing for PCXL in arch_sync_dma_for_cpu()"
* tag 'parisc-for-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix flush_dcache_page() for usage from irq context
parisc: Handle kgdb breakpoints only in kernel context
parisc: Handle kprobes breakpoints only in kernel context
parisc: Allow to reboot machine after system halt
parisc: Enable LOCKDEP support
parisc: Add lightweight spinlock checks
parisc: Use num_present_cpus() in alternative patching code
parisc: Flush gatt writes and adjust gatt mask in parisc_agp_mask_memory()
parisc: Improve cache flushing for PCXL in arch_sync_dma_for_cpu()
|
|
It turns out that udev under certain circumstances will concurrently try
to load the same modules over-and-over excessively. This isn't a kernel
bug, but it ends up affecting the kernel, to the point that under
certain circumstances we can fail to boot, because the kernel uses a lot
of memory to read all the module data all at once.
Note that it isn't a memory leak, it's just basically a thundering herd
problem happening at bootup with a lot of CPUs, with the worst cases
then being pretty bad.
Admittedly the worst situations are somewhat contrived: lots and lots of
CPUs, not a lot of memory, and KASAN enabled to make it all slower and
as such (unintentionally) exacerbate the problem.
Luis explains: [1]
"My best assessment of the situation is that each CPU in udev ends up
triggering a load of duplicate set of modules, not just one, but *a
lot*. Not sure what heuristics udev uses to load a set of modules per
CPU."
Petr Pavlu chimes in: [2]
"My understanding is that udev workers are forked. An initial kmod
context is created by the main udevd process but no sharing happens
after the fork. It means that the mentioned memory pool logic doesn't
really kick in.
Multiple parallel load requests come from multiple udev workers, for
instance, each handling an udev event for one CPU device and making
the exactly same requests as all others are doing at the same time.
The optimization idea would be to recognize these duplicate requests
at the udevd/kmod level and converge them"
Note that module loading has tried to mitigate this issue before, see
for example commit 064f4536d139 ("module: avoid allocation if module is
already present and ready"), which has a few ASCII graphs on memory use
due to this same issue.
However, while that noticed that the module was already loaded, and
exited with an error early before spending any more time on setting up
the module, it didn't handle the case of multiple concurrent module
loads all being active - but not complete - at the same time.
Yes, one of them will eventually win the race and finalize its copy, and
the others will then notice that the module already exists and error
out, but while this all happens, we have tons of unnecessary concurrent
work being done.
Again, the real fix is for udev to not do that (maybe it should use
threads instead of fork, and have actual shared data structures and not
cause duplicate work). That real fix is apparently not trivial.
But it turns out that the kernel already has a pretty good model for
dealing with concurrent access to the same file: the i_writecount of the
inode.
In fact, the module loading already indirectly uses 'i_writecount' ,
because 'kernel_file_read()' will in fact do
ret = deny_write_access(file);
if (ret)
return ret;
...
allow_write_access(file);
around the read of the file data. We do not allow concurrent writes to
the file, and return -ETXTBUSY if the file was open for writing at the
same time as the module data is loaded from it.
And the solution to the reader concurrency problem is to simply extend
this "no concurrent writers" logic to simply be "exclusive access".
Note that "exclusive" in this context isn't really some absolute thing:
it's only exclusion from writers and from other "special readers" that
do this writer denial. So we simply introduce a variation of that
"deny_write_access()" logic that not only denies write access, but also
requires that this is the _only_ such access that denies write access.
Which means that you can't start loading a module that is already being
loaded as a module by somebody else, or you will get the same -ETXTBSY
error that you would get if there were writers around.
[ It also means that you can't try to load a currently executing
executable as a module, for the same reason: executables do that same
"deny_write_access()" thing, and that's obviously where the whole
ETXTBSY logic traditionally came from.
This is not a problem for kernel modules, since the set of normal
executable files and kernel module files is entirely disjoint. ]
This new function is called "exclusive_deny_write_access()", and the
implementation is trivial, in that it's just an atomic decrement of
i_writecount if it was 0 before.
To use that new exclusivity check, all we then do is wrap the module
loading with that exclusive_deny_write_access()() / allow_write_access()
pair. The actual patch is a bit bigger than that, because we want to
surround not just the "load file data" part, but the whole module setup,
to get maximum exclusion.
So this ends up splitting up "finit_module()" into a few helper
functions to make it all very clear and legible.
In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]),
the "wasted vmalloc" space (ie module data read into a vmalloc'ed area
in order to be loaded as a module, but then discarded because somebody
else loaded the same module instead) dropped from 1.8GiB to 474kB. Yes,
that's gigabytes to kilobytes.
It doesn't drop completely to zero, because even with this change, you
can still end up having completely serial pointless module loads, where
one udev process has loaded a module fully (and thus the kernel has
released that exclusive lock on the module file), and then another udev
process tries to load the same module again.
So while we cannot fully get rid of the fundamental bug in user space,
we _can_ get rid of the excessive concurrent thundering herd effect.
A couple of final side notes on this all:
- This tweak only affects the "finit_module()" system call, which gives
the kernel a file descriptor with the module data.
You can also just feed the module data as raw data from user space
with "init_module()" (note the lack of 'f' at the beginning), and
obviously for that case we do _not_ have any "exclusive read" logic.
So if you absolutely want to do things wrong in user space, and try
to load the same module multiple times, and error out only later when
the kernel ends up saying "you can't load the same module name
twice", you can still do that.
And in fact, some distros will do exactly that, because they will
uncompress the kernel module data in user space before feeding it to
the kernel (mainly because they haven't started using the new kernel
side decompression yet).
So this is not some absolute "you can't do concurrent loads of the
same module". It's literally just a very simple heuristic that will
catch it early in case you try to load the exact same module file at
the same time, and in that case avoid a potentially nasty situation.
- There is another user of "deny_write_access()": the verity code that
enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl).
If you use fs-verity and you care about verifying the kernel modules
(which does make sense), you should do it *before* loading said
kernel module. That may sound obvious, but now the implementation
basically requires it. Because if you try to do it concurrently, the
kernel may refuse to load the module file that is being set up by the
fs-verity code.
- This all will obviously mean that if you insist on loading the same
module in parallel, only one module load will succeed, and the others
will return with an error.
That was true before too, but what is different is that the -ETXTBSY
error can be returned *before* the success case of another process
fully loading and instantiating the module.
Again, that might sound obvious, and it is indeed the whole point of
the whole change: we are much quicker to notice the whole "you're
already in the process of loading this module".
So it's very much intentional, but it does mean that if you just
spray the kernel with "finit_module()", and expect that the module is
immediately loaded afterwards without checking the return value, you
are doing something horribly horribly wrong.
I'd like to say that that would never happen, but the whole _reason_
for this commit is that udev is currently doing something horribly
horribly wrong, so ...
Link: https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/ [1]
Link: https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/ [2]
Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%2FAAUi5z@bombadil.infradead.org/ [3]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- During the acl rework we merged this cycle the generic_listxattr()
helper had to be modified in a way that in principle it would allow
for POSIX ACLs to be reported. At least that was the impression we
had initially. Because before the acl rework POSIX ACLs would be
reported if the filesystem did have POSIX ACL xattr handlers in
sb->s_xattr. That logic changed and now we can simply check whether
the superblock has SB_POSIXACL set and if the inode has
inode->i_{default_}acl set report the appropriate POSIX ACL name.
However, we didn't realize that generic_listxattr() was only ever
used by two filesystems. Both of them don't support POSIX ACLs via
sb->s_xattr handlers and so never reported POSIX ACLs via
generic_listxattr() even if they raised SB_POSIXACL and did contain
inodes which had acls set. The example here is nfs4.
As a result, generic_listxattr() suddenly started reporting POSIX
ACLs when it wouldn't have before. Since SB_POSIXACL implies that the
umask isn't stripped in the VFS nfs4 can't just drop SB_POSIXACL from
the superblock as it would also alter umask handling for them.
So just have generic_listxattr() not report POSIX ACLs as it never
did anyway. It's documented as such.
- Our SB_* flags currently use a signed integer and we shift the last
bit causing UBSAN to complain about undefined behavior. Switch to
using unsigned. While the original patch used an explicit unsigned
bitshift it's now pretty common to rely on the BIT() macro in a lot
of headers nowadays. So the patch has been adjusted to use that.
- Add Namjae as ntfs reviewer. They're already active this cycle so
let's make it explicit right now.
* tag 'vfs/v6.4-rc3/misc.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ntfs: Add myself as a reviewer
fs: don't call posix_acl_listxattr in generic_listxattr
fs: fix undefined behavior in bit shift for SB_NOUSER
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth and bpf.
Current release - regressions:
- net: fix skb leak in __skb_tstamp_tx()
- eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
Current release - new code bugs:
- handshake:
- fix sock->file allocation
- fix handshake_dup() ref counting
- bluetooth:
- fix potential double free caused by hci_conn_unlink
- fix UAF in hci_conn_hash_flush
Previous releases - regressions:
- core: fix stack overflow when LRO is disabled for virtual
interfaces
- tls: fix strparser rx issues
- bpf:
- fix many sockmap/TCP related issues
- fix a memory leak in the LRU and LRU_PERCPU hash maps
- init the offload table earlier
- eth: mlx5e:
- do as little as possible in napi poll when budget is 0
- fix using eswitch mapping in nic mode
- fix deadlock in tc route query code
Previous releases - always broken:
- udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()
- raw: fix output xfrm lookup wrt protocol
- smc: reset connection when trying to use SMCRv2 fails
- phy: mscc: enable VSC8501/2 RGMII RX clock
- eth: octeontx2-pf: fix TSOv6 offload
- eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize"
* tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
net: phy: mscc: enable VSC8501/2 RGMII RX clock
net: phy: mscc: remove unnecessary phydev locking
net: phy: mscc: add support for VSC8501
net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE
net/handshake: Enable the SNI extension to work properly
net/handshake: Unpin sock->file if a handshake is cancelled
net/handshake: handshake_genl_notify() shouldn't ignore @flags
net/handshake: Fix uninitialized local variable
net/handshake: Fix handshake_dup() ref counting
net/handshake: Remove unneeded check from handshake_dup()
ipv6: Fix out-of-bounds access in ipv6_find_tlv()
net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
docs: netdev: document the existence of the mail bot
net: fix skb leak in __skb_tstamp_tx()
r8169: Use a raw_spinlock_t for the register locks.
page_pool: fix inconsistency for page_pool_ring_[un]lock()
bpf, sockmap: Test progs verifier error with latest clang
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
...
|
|
Traditionally, all CPUs in a system have identical numbers of SMT
siblings. That changes with hybrid processors where some logical CPUs
have a sibling and others have none.
Today, the CPU boot code sets the global variable smp_num_siblings when
every CPU thread is brought up. The last thread to boot will overwrite
it with the number of siblings of *that* thread. That last thread to
boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it
is an Ecore, smp_num_siblings == 1.
smp_num_siblings describes if the *system* supports SMT. It should
specify the maximum number of SMT threads among all cores.
Ensure that smp_num_siblings represents the system-wide maximum number
of siblings by always increasing its value. Never allow it to decrease.
On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are
not updated in any cpu sibling map because the system is treated as an
UP system when probing Ecore CPUs.
Below shows part of the CPU topology information before and after the
fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
...
-/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
-/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
...
-/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
-/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
+/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21
Notice that the "before" 'package_cpus_list' has only one CPU. This
means that userspace tools like lscpu will see a little laptop like
an 11-socket system:
-Core(s) per socket: 1
-Socket(s): 11
+Core(s) per socket: 16
+Socket(s): 1
This is also expected to make the scheduler do rather wonky things
too.
[ dhansen: remove CPUID detail from changelog, add end user effects ]
CC: stable@kernel.org
Fixes: bbb65d2d365e ("x86: use cpuid vector 0xb when available for detecting cpu topology")
Fixes: 95f3d39ccf7a ("x86/cpu/topology: Provide detect_extended_topology_early()")
Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20230323015640.27906-1-rui.zhang%40intel.com
|