Age | Commit message (Collapse) | Author |
|
n->output field can be read locklessly, while a writer
might change the pointer concurrently.
Add missing annotations to prevent load-store tearing.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Eric Dumazet says:
====================
net: use DEV_STATS_xxx() helpers in virtio_net and l2tp_eth
Inspired by another (minor) KCSAN syzbot report.
Both virtio_net and l2tp_eth can use DEV_STATS_xxx() helpers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Core networking has opt-in atomic variant of dev->stats,
simply use DEV_STATS_INC(), DEV_STATS_ADD() and DEV_STATS_READ().
v2: removed @priv local var in l2tp_eth_dev_recv() (Simon)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use DEV_STATS_INC() and DEV_STATS_READ() which provide
atomicity on paths that can be used concurrently.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Companion of DEV_STATS_INC() & DEV_STATS_ADD().
This is going to be used in the series.
Use it in macsec_get_stats64().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch extends flower offload support for MPLS protocol.
Due to hardware limitation, currently driver supports lse
depth up to 4.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While looking at a related syzbot report involving neigh_periodic_work(),
I found that I forgot to add an annotation when deleting an
RCU protected item from a list.
Readers use rcu_deference(*np), we need to use either
rcu_assign_pointer() or WRITE_ONCE() on writer side
to prevent store tearing.
I use rcu_assign_pointer() to have lockdep support,
this was the choice made in neigh_flush_dev().
Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit d8131c2965d5 ("kbuild: remove $(MODLIB)/source symlink"),
modules_install does not create the 'source' symlink.
Remove the stale code from builddeb and kernel.spec.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
bluetooth pull request for net:
- Fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- Fix handling of listen for ISO unicast
- Fix build warnings
- Fix leaking content of local_codecs
- Add shutdown function for QCA6174
- Delete unused hci_req_prepare_suspend() declaration
- Fix hci_link_tx_to RCU lock usage
- Avoid redundant authentication
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During review of the patch that became 2e0ec0afa902 ("net: ethernet:
xilinx: Convert to platform remove callback returning void") in
net-next, Radhey Shyam Pandey pointed out that the change makes the
documentation about the return value obsolete. The patch was applied
without addressing this feedback, so here comes a fix in a separate
patch.
Fixes: 2e0ec0afa902 ("net: ethernet: xilinx: Convert to platform remove callback returning void")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The second parameter of stmmac_pltfr_init() needs the pointer of
"struct plat_stmmacenet_data". So, correct the parameter typo when calling the
function.
Otherwise, it may cause this alignment exception when doing suspend/resume.
[ 49.067201] CPU1 is up
[ 49.135258] Internal error: SP/PC alignment exception: 000000008a000000 [#1] PREEMPT SMP
[ 49.143346] Modules linked in: soc_imx9 crct10dif_ce polyval_ce nvmem_imx_ocotp_fsb_s400 polyval_generic layerscape_edac_mod snd_soc_fsl_asoc_card snd_soc_imx_audmux snd_soc_imx_card snd_soc_wm8962 el_enclave snd_soc_fsl_micfil rtc_pcf2127 rtc_pcf2131 flexcan can_dev snd_soc_fsl_xcvr snd_soc_fsl_sai imx8_media_dev(C) snd_soc_fsl_utils fuse
[ 49.173393] CPU: 0 PID: 565 Comm: sh Tainted: G C 6.5.0-rc4-next-20230804-05047-g5781a6249dae #677
[ 49.183721] Hardware name: NXP i.MX93 11X11 EVK board (DT)
[ 49.189190] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 49.196140] pc : 0x80800052
[ 49.198931] lr : stmmac_pltfr_resume+0x34/0x50
[ 49.203368] sp : ffff800082f8bab0
[ 49.206670] x29: ffff800082f8bab0 x28: ffff0000047d0ec0 x27: ffff80008186c170
[ 49.213794] x26: 0000000b5e4ff1ba x25: ffff800081e5fa74 x24: 0000000000000010
[ 49.220918] x23: ffff800081fe0000 x22: 0000000000000000 x21: 0000000000000000
[ 49.228042] x20: ffff0000001b4010 x19: ffff0000001b4010 x18: 0000000000000006
[ 49.235166] x17: ffff7ffffe007000 x16: ffff800080000000 x15: 0000000000000000
[ 49.242290] x14: 00000000000000fc x13: 0000000000000000 x12: 0000000000000000
[ 49.249414] x11: 0000000000000001 x10: 0000000000000a60 x9 : ffff800082f8b8c0
[ 49.256538] x8 : 0000000000000008 x7 : 0000000000000001 x6 : 000000005f54a200
[ 49.263662] x5 : 0000000001000000 x4 : ffff800081b93680 x3 : ffff800081519be0
[ 49.270786] x2 : 0000000080800052 x1 : 0000000000000000 x0 : ffff0000001b4000
[ 49.277911] Call trace:
[ 49.280346] 0x80800052
[ 49.282781] platform_pm_resume+0x2c/0x68
[ 49.286785] dpm_run_callback.constprop.0+0x74/0x134
[ 49.291742] device_resume+0x88/0x194
[ 49.295391] dpm_resume+0x10c/0x230
[ 49.298866] dpm_resume_end+0x18/0x30
[ 49.302515] suspend_devices_and_enter+0x2b8/0x624
[ 49.307299] pm_suspend+0x1fc/0x348
[ 49.310774] state_store+0x80/0x104
[ 49.314258] kobj_attr_store+0x18/0x2c
[ 49.318002] sysfs_kf_write+0x44/0x54
[ 49.321659] kernfs_fop_write_iter+0x120/0x1ec
[ 49.326088] vfs_write+0x1bc/0x300
[ 49.329485] ksys_write+0x70/0x104
[ 49.332874] __arm64_sys_write+0x1c/0x28
[ 49.336783] invoke_syscall+0x48/0x114
[ 49.340527] el0_svc_common.constprop.0+0xc4/0xe4
[ 49.345224] do_el0_svc+0x38/0x98
[ 49.348526] el0_svc+0x2c/0x84
[ 49.351568] el0t_64_sync_handler+0x100/0x12c
[ 49.355910] el0t_64_sync+0x190/0x194
[ 49.359567] Code: ???????? ???????? ???????? ???????? (????????)
[ 49.365644] ---[ end trace 0000000000000000 ]---
Fixes: 97117eb51ec8 ("net: stmmac: platform: provide stmmac_pltfr_init()")
Signed-off-by: Clark Wang <xiaoning.wang@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Switch to napi_consume_skb() to take advantage of bulk free, and skb
reuse through skb cache in conjunction with napi_build_skb().
When parameter 'budget' = 0, indicating non-NAPI context,
dev_consume_skb_any() is called internally.
Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Eric Dumazet says:
====================
net_sched: sch_fq: round of improvements
For FQ tenth anniversary, it was time for making it faster.
The FQ part (as in Fair Queue) is rather expensive, because
we have to classify packets and store them in a per-flow structure,
and add this per-flow structure in a hash table. Then the RR lists
also add cache line misses.
Most fq qdisc are almost idle. Trying to share NIC bandwidth has
no benefits, thus the qdisc could behave like a FIFO.
This series brings a 5 % throughput increase in intensive
tcp_rr workload, and 13 % increase for (unpaced) UDP packets.
v2: removed an extra label (build bot).
Fix an accidental increase of stat_internal_packets counter
in fast path.
Added "constify qdisc_priv()" patch to allow fq_fastpath_check()
first parameter to be const.
typo on 'eligible' (Willem)
====================
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
FQ performs garbage collection at enqueue time, and only
if number of flows is above a given threshold, which
is hit after the qdisc has been used a bit.
Since an RB-tree traversal is needed to locate a flow,
it makes sense to perform gc all the time, to keep
rb-trees smaller.
This reduces by 50 % average storage costs in FQ,
and avoids 1 cache line miss at enqueue time when
fast path added in prior patch can not be used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCQ_F_CAN_BYPASS can be used by few qdiscs.
Idea is that if we queue a packet to an empty qdisc,
following dequeue() would pick it immediately.
FQ can not use the generic TCQ_F_CAN_BYPASS code,
because some additional checks need to be performed.
This patch adds a similar fast path to FQ.
Most of the time, qdisc is not throttled,
and many packets can avoid bringing/touching
at least four cache lines, and consuming 128bytes
of memory to store the state of a flow.
After this patch, netperf can send UDP packets about 13 % faster,
and pktgen goes 30 % faster (when FQ is in the way), on a fast NIC.
TCP traffic is also improved, thanks to a reduction of cache line misses.
I have measured a 5 % increase of throughput on a tcp_rr intensive workload.
tc -s -d qd sh dev eth1
...
qdisc fq 8004: parent 1:2 limit 10000p flow_limit 100p buckets 1024
orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit
refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
Sent 5646784384 bytes 1985161 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
flows 122 (inactive 122 throttled 0)
gc 0 highprio 0 fastpath 659990 throttled 27762 latency 8.57us
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, when one fq qdisc has no more packets to send, it can still
have some flows stored in its RR lists (q->new_flows & q->old_flows)
This was a design choice, but what is a bit disturbing is that
the inactive_flows counter does not include the count of empty flows
in RR lists.
As next patch needs to know better if there are active flows,
this change makes inactive_flows exact.
Before the patch, following command on an empty qdisc could have returned:
lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
flows 1322 (inactive 1316 throttled 0)
flows 1330 (inactive 1325 throttled 0)
flows 1193 (inactive 1190 throttled 0)
flows 1208 (inactive 1202 throttled 0)
After the patch, we now have:
lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
flows 1322 (inactive 1322 throttled 0)
flows 1330 (inactive 1330 throttled 0)
flows 1193 (inactive 1193 throttled 0)
flows 1208 (inactive 1208 throttled 0)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
q->flows can be often modified, and q->timer_slack is read mostly.
Exchange the two fields, so that cache line countaining
quantum, initial_quantum, and other critical parameters
stay clean (read-mostly).
Move q->watchdog next to q->stat_throttled
Add comments explaining how the structure is split in
three different parts.
pahole output before the patch:
struct fq_sched_data {
struct fq_flow_head new_flows; /* 0 0x10 */
struct fq_flow_head old_flows; /* 0x10 0x10 */
struct rb_root delayed; /* 0x20 0x8 */
u64 time_next_delayed_flow; /* 0x28 0x8 */
u64 ktime_cache; /* 0x30 0x8 */
unsigned long unthrottle_latency_ns; /* 0x38 0x8 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct fq_flow internal __attribute__((__aligned__(64))); /* 0x40 0x80 */
/* XXX last struct has 16 bytes of padding */
/* --- cacheline 3 boundary (192 bytes) --- */
u32 quantum; /* 0xc0 0x4 */
u32 initial_quantum; /* 0xc4 0x4 */
u32 flow_refill_delay; /* 0xc8 0x4 */
u32 flow_plimit; /* 0xcc 0x4 */
unsigned long flow_max_rate; /* 0xd0 0x8 */
u64 ce_threshold; /* 0xd8 0x8 */
u64 horizon; /* 0xe0 0x8 */
u32 orphan_mask; /* 0xe8 0x4 */
u32 low_rate_threshold; /* 0xec 0x4 */
struct rb_root * fq_root; /* 0xf0 0x8 */
u8 rate_enable; /* 0xf8 0x1 */
u8 fq_trees_log; /* 0xf9 0x1 */
u8 horizon_drop; /* 0xfa 0x1 */
/* XXX 1 byte hole, try to pack */
<bad> u32 flows; /* 0xfc 0x4 */
/* --- cacheline 4 boundary (256 bytes) --- */
u32 inactive_flows; /* 0x100 0x4 */
u32 throttled_flows; /* 0x104 0x4 */
u64 stat_gc_flows; /* 0x108 0x8 */
u64 stat_internal_packets; /* 0x110 0x8 */
u64 stat_throttled; /* 0x118 0x8 */
u64 stat_ce_mark; /* 0x120 0x8 */
u64 stat_horizon_drops; /* 0x128 0x8 */
u64 stat_horizon_caps; /* 0x130 0x8 */
u64 stat_flows_plimit; /* 0x138 0x8 */
/* --- cacheline 5 boundary (320 bytes) --- */
u64 stat_pkts_too_long; /* 0x140 0x8 */
u64 stat_allocation_errors; /* 0x148 0x8 */
<bad> u32 timer_slack; /* 0x150 0x4 */
/* XXX 4 bytes hole, try to pack */
struct qdisc_watchdog watchdog; /* 0x158 0x48 */
/* size: 448, cachelines: 7, members: 34 */
/* sum members: 411, holes: 2, sum holes: 5 */
/* padding: 32 */
/* paddings: 1, sum paddings: 16 */
/* forced alignments: 1 */
};
pahole output after the patch:
struct fq_sched_data {
struct fq_flow_head new_flows; /* 0 0x10 */
struct fq_flow_head old_flows; /* 0x10 0x10 */
struct rb_root delayed; /* 0x20 0x8 */
u64 time_next_delayed_flow; /* 0x28 0x8 */
u64 ktime_cache; /* 0x30 0x8 */
unsigned long unthrottle_latency_ns; /* 0x38 0x8 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct fq_flow internal __attribute__((__aligned__(64))); /* 0x40 0x80 */
/* XXX last struct has 16 bytes of padding */
/* --- cacheline 3 boundary (192 bytes) --- */
u32 quantum; /* 0xc0 0x4 */
u32 initial_quantum; /* 0xc4 0x4 */
u32 flow_refill_delay; /* 0xc8 0x4 */
u32 flow_plimit; /* 0xcc 0x4 */
unsigned long flow_max_rate; /* 0xd0 0x8 */
u64 ce_threshold; /* 0xd8 0x8 */
u64 horizon; /* 0xe0 0x8 */
u32 orphan_mask; /* 0xe8 0x4 */
u32 low_rate_threshold; /* 0xec 0x4 */
struct rb_root * fq_root; /* 0xf0 0x8 */
u8 rate_enable; /* 0xf8 0x1 */
u8 fq_trees_log; /* 0xf9 0x1 */
u8 horizon_drop; /* 0xfa 0x1 */
/* XXX 1 byte hole, try to pack */
<good> u32 timer_slack; /* 0xfc 0x4 */
/* --- cacheline 4 boundary (256 bytes) --- */
<good> u32 flows; /* 0x100 0x4 */
u32 inactive_flows; /* 0x104 0x4 */
u32 throttled_flows; /* 0x108 0x4 */
/* XXX 4 bytes hole, try to pack */
u64 stat_throttled; /* 0x110 0x8 */
<better> struct qdisc_watchdog watchdog; /* 0x118 0x48 */
/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
u64 stat_gc_flows; /* 0x160 0x8 */
u64 stat_internal_packets; /* 0x168 0x8 */
u64 stat_ce_mark; /* 0x170 0x8 */
u64 stat_horizon_drops; /* 0x178 0x8 */
/* --- cacheline 6 boundary (384 bytes) --- */
u64 stat_horizon_caps; /* 0x180 0x8 */
u64 stat_flows_plimit; /* 0x188 0x8 */
u64 stat_pkts_too_long; /* 0x190 0x8 */
u64 stat_allocation_errors; /* 0x198 0x8 */
/* Force padding: */
u64 :64;
u64 :64;
u64 :64;
u64 :64;
/* size: 448, cachelines: 7, members: 34 */
/* sum members: 411, holes: 2, sum holes: 5 */
/* padding: 32 */
/* paddings: 1, sum paddings: 16 */
/* forced alignments: 1 */
};
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In order to propagate const qualifiers, we change qdisc_priv()
to accept a possibly const argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Eric Dumazet says:
====================
tcp: add tcp_delack_max()
First patches are adding const qualifiers to four existing helpers.
Third patch adds a much needed companion feature to RTAX_RTO_MIN.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While BPF allows to set icsk->->icsk_delack_max
and/or icsk->icsk_rto_min, we have an ip route
attribute (RTAX_RTO_MIN) to be able to tune rto_min,
but nothing to consequently adjust max delayed ack,
which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}).
This makes RTAX_RTO_MIN of almost no practical use,
unless customers are in big trouble.
Modern days datacenter communications want to set
rto_min to ~5 ms, and the max delayed ack one jiffie
smaller to avoid spurious retransmits.
After this patch, an "rto_min 5" route attribute will
effectively lower max delayed ack timers to 4 ms.
Note in the following ss output, "rto:6 ... ato:4"
$ ss -temoi dst XXXXXX
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597
ino:255134 sk:1001 <->
skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack
cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500
rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121
bytes_received:54823120 segs_out:1370582 segs_in:1370580
data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps
pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579
busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920
rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536
While we could argue this patch fixes a bug with RTAX_RTO_MIN,
I do not add a Fixes: tag, so that we can soak it a bit before
asking backports to stable branches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make clear these functions do not change any field from TCP socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Both helpers only read fields from their socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Drivers must not reference functions marked with __exit as these likely
are not available when the code is built-in.
There are few creative offenders uncovered for example in ARCH=amd64
allmodconfig builds. So only trigger the section mismatch warning for
W=1 builds.
The dual rule that drivers must not reference .init.* is implemented
since commit 0db252452378 ("modpost: don't allow *driver to reference
.init.*") which however missed that .exit.* should be handled in the
same way.
Thanks to Masahiro Yamada and Arnd Bergmann who gave valuable hints to
find this improvement.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Remove the left-over of commit e24f6628811e ("modpost: remove all
traces of cpuinit/cpuexit sections").
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
Without this 'else' statement, an "usb" name goes into two handlers:
the first/previous 'if' statement _AND_ the for-loop over 'devtable',
but the latter is useless as it has no 'usb' device_id entry anyway.
Tested with allmodconfig before/after patch; no changes to *.mod.c:
git checkout v6.6-rc3
make -j$(nproc) allmodconfig
make -j$(nproc) olddefconfig
make -j$(nproc)
find . -name '*.mod.c' | cpio -pd /tmp/before
# apply patch
make -j$(nproc)
find . -name '*.mod.c' | cpio -pd /tmp/after
diff -r /tmp/before/ /tmp/after/
# no difference
Fixes: acbef7b76629 ("modpost: fix module autoloading for OF devices with generic compatible property")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"These are the latest bug fixes that have come up in the soc tree. Most
of these are fairly minor. Most notably, the majority of changes this
time are not for dts files as usual.
- Updates to the addresses of the broadcom and aspeed entries in the
MAINTAINERS file.
- Defconfig updates to address a regression on samsung and a build
warning from an unknown Kconfig symbol
- Build fixes for the StrongARM and Uniphier platforms
- Code fixes for SCMI and FF-A firmware drivers, both of which had a
simple bug that resulted in invalid data, and a lesser fix for the
optee firmware driver
- Multiple fixes for the recently added loongson/loongarch "guts" soc
driver
- Devicetree fixes for RISC-V on the startfive platform, addressing
issues with NOR flash, usb and uart.
- Multiple fixes for NXP i.MX8/i.MX9 dts files, fixing problems with
clock, gpio, hdmi settings and the Makefile
- Bug fixes for i.MX firmware code and the OCOTP soc driver
- Multiple fixes for the TI sysc bus driver
- Minor dts updates for TI omap dts files, to address boot time
warnings and errors"
* tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (35 commits)
MAINTAINERS: Fix Florian Fainelli's email address
arm64: defconfig: enable syscon-poweroff driver
ARM: locomo: fix locomolcd_power declaration
soc: loongson: loongson2_guts: Remove unneeded semicolon
soc: loongson: loongson2_guts: Convert to devm_platform_ioremap_resource()
soc: loongson: loongson_pm2: Populate children syscon nodes
dt-bindings: soc: loongson,ls2k-pmc: Allow syscon-reboot/syscon-poweroff as child
soc: loongson: loongson_pm2: Drop useless of_device_id compatible
dt-bindings: soc: loongson,ls2k-pmc: Use fallbacks for ls2k-pmc compatible
soc: loongson: loongson_pm2: Add dependency for INPUT
arm64: defconfig: remove CONFIG_COMMON_CLK_NPCM8XX=y
ARM: uniphier: fix cache kernel-doc warnings
MAINTAINERS: aspeed: Update Andrew's email address
MAINTAINERS: aspeed: Update git tree URL
firmware: arm_ffa: Don't set the memory region attributes for MEM_LEND
arm64: dts: imx: Add imx8mm-prt8mm.dtb to build
arm64: dts: imx8mm-evk: Fix hdmi@3d node
soc: imx8m: Enable OCOTP clock for imx8mm before reading registers
arm64: dts: imx8mp-beacon-kit: Fix audio_pll2 clock
arm64: dts: imx8mp: Fix SDMA2/3 clocks
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Make sure 32-bit applications using user events have aligned access
when running on a 64-bit kernel.
- Add cond_resched in the loop that handles converting enums in
print_fmt string is trace events.
- Fix premature wake ups of polling processes in the tracing ring
buffer. When a task polls waiting for a percentage of the ring buffer
to be filled, the writer still will wake it up at every event. Add
the polling's percentage to the "shortest_full" list to tell the
writer when to wake it up.
- For eventfs dir lookups on dynamic events, an event system's only
event could be removed, leaving its dentry with no children. This is
totally legitimate. But in eventfs_release() it must not access the
children array, as it is only allocated when the dentry has children.
* tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Test for dentries array allocated in eventfs_release()
tracing/user_events: Align set_bit() address for all archs
tracing: relax trace_event_eval_update() execution with cond_resched()
ring-buffer: Update "shortest_full" in polling
|
|
The dcache_dir_open_wrapper() could be called when a dynamic event is
being deleted leaving a dentry with no children. In this case the
dlist->dentries array will never be allocated. This needs to be checked
for in eventfs_release(), otherwise it will trigger a NULL pointer
dereference.
Link: https://lore.kernel.org/linux-trace-kernel/20230930090106.1c3164e9@rorschach.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: ef36b4f92868 ("eventfs: Remember what dentries were created on dir open")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.
Add a compat flag to user_event_enabler that indicates when a 32-bit
value is being used on a 64-bit kernel. Long align addresses and correct
the bit to be used by set_bit() to account for this alignment. Ensure
compat flags are copied during forks and used during deletion clears.
Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
Cc: stable@vger.kernel.org
Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
Reported-by: Clément Léger <cleger@rivosinc.com>
Suggested-by: Clément Léger <cleger@rivosinc.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When kernel is compiled without preemption, the eval_map_work_func()
(which calls trace_event_eval_update()) will not be preempted up to its
complete execution. This can actually cause a problem since if another
CPU call stop_machine(), the call will have to wait for the
eval_map_work_func() function to finish executing in the workqueue
before being able to be scheduled. This problem was observe on a SMP
system at boot time, when the CPU calling the initcalls executed
clocksource_done_booting() which in the end calls stop_machine(). We
observed a 1 second delay because one CPU was executing
eval_map_work_func() and was not preempted by the stop_machine() task.
Adding a call to cond_resched() in trace_event_eval_update() allows
other tasks to be executed and thus continue working asynchronously
like before without blocking any pending task at boot time.
Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84ff by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.
The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.
Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.
To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.
Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.
Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix the narea calculation in swiotlb initialization (Ross Lagerwall)
- fix the check whether a device has used swiotlb (Petr Tesarik)
* tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix the check whether a device has used software IO TLB
swiotlb: use the calculated number of areas
|
|
Pull iomap fixes from Darrick Wong:
- Handle a race between writing and shrinking block devices by
returning EIO
- Fix a typo in a comment
* tag 'iomap-6.6-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Spelling s/preceeding/preceding/g
iomap: add a workaround for racy i_size updates on block devices
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ice: add PTP auxiliary bus support
Michal Michalik says:
Auxiliary bus allows exchanging information between PFs, which allows
both fixing problems and simplifying new features implementation.
The auxiliary bus is enabled for all devices supported by ice driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Usual business: a driver fix, a DT fix, a minor core fix"
* tag 'i2c-for-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: npcm7xx: Fix callback completion ordering
i2c: mux: Avoid potential false error message in i2c_mux_add_adapter
dt-bindings: i2c: mxs: Pass ref and 'unevaluatedProperties: false'
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix a possible NULL pointer dereference in the error path of
acpi_video_bus_add() resulting from recent changes (Dinghao Liu)"
* tag 'acpi-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: video: Fix NULL pointer dereference in acpi_video_bus_add()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix arch_stack_walk_reliable(), used by live patching
- Fix powerpc selftests to work with run_kselftest.sh
Thanks to Joe Lawrence and Petr Mladek.
* tag 'powerpc-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix emit_tests to work with run_kselftest.sh
powerpc/stacktrace: Fix arch_stack_walk_reliable()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix NFSv4 READ corner case
* tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not set
|
|
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of
KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as
reported by Nathan, the adjustment is not enough, because
__kmalloc_minalign() also decides the minimal alignment of slab object
as shown in new_kmalloc_cache() and its value may be greater than
KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM).
Instead of invoking __kmalloc_minalign() in bpf subsystem to find the
maximal alignment, just using kmalloc_size_roundup() directly to get the
corresponding slab object size for each allocation size. If these two
sizes are unmatched, adjust size_index to select a bpf_mem_cache with
unit_size equal to the object_size of the underlying slab cache for the
allocation size.
Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Pull smb client fix from Steve French:
"Fix for password freeing potential oops (also for stable)"
* tag '6.6-rc3-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
fs/smb/client: Reset password pointer to NULL
|
|
Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period. They failed because failing to take __kexec_lock.
=======
[ 78.714569] Fallback order for Node 0: 0
[ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886
[ 78.717133] Policy zone: Normal
[ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 80.056643] PEFILE: Unsigned PE binary
=======
The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively. So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.
Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically. This doesn't impact the usage of __kexec_lock.
Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 247262756121 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
hugetlb_reparenting_test.sh that may cause error
According to the awk manual, the -e option does not need to be specified
in front of 'program' (unless you need to mix program-file).
The redundant -e option can cause error when users use awk tools other
than gawk (for example, mawk does not support the -e option).
Error Example:
awk: not an option: -e
Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When CONFIG_DAMON_VADDR_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
Since commit 9f86d624292c ("mm/damon/vaddr-test: remove unnecessary
variables"), the damon_destroy_ctx() is removed, but still call
damon_new_target() and damon_new_region(), the damon_region which is
allocated by kmem_cache_alloc() in damon_new_region() and the damon_target
which is allocated by kmalloc in damon_new_target() are not freed. And
the damon_region which is allocated in damon_new_region() in
damon_set_regions() is also not freed.
So use damon_destroy_target to free all the damon_regions and damon_target.
unreferenced object 0xffff888107c9a940 (size 64):
comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b ............kkkk
60 c7 9c 07 81 88 ff ff f8 cb 9c 07 81 88 ff ff `...............
backtrace:
[<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
[<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
[<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
[<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff8881079cc740 (size 56):
comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
hex dump (first 32 bytes):
05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 ................
6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk
backtrace:
[<ffffffff819bc492>] damon_new_region+0x22/0x1c0
[<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
[<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff888107c9ac40 (size 64):
comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b ............kkkk
a0 cc 9c 07 81 88 ff ff 78 a1 76 07 81 88 ff ff ........x.v.....
backtrace:
[<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
[<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
[<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
[<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff8881079ccc80 (size 56):
comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
hex dump (first 32 bytes):
05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 ................
6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk
backtrace:
[<ffffffff819bc492>] damon_new_region+0x22/0x1c0
[<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
[<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff888107c9af40 (size 64):
comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b ............kkkk
20 a2 76 07 81 88 ff ff b8 a6 76 07 81 88 ff ff .v.......v.....
backtrace:
[<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
[<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
[<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
[<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff88810776a200 (size 56):
comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
hex dump (first 32 bytes):
05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 ................
6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk
backtrace:
[<ffffffff819bc492>] damon_new_region+0x22/0x1c0
[<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
[<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff88810776a740 (size 56):
comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.025s)
hex dump (first 32 bytes):
3d 00 00 00 00 00 00 00 3f 00 00 00 00 00 00 00 =.......?.......
6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk
backtrace:
[<ffffffff819bc492>] damon_new_region+0x22/0x1c0
[<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
[<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
[<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff888108038240 (size 64):
comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 03 00 00 00 6b 6b 6b 6b ............kkkk
48 ad 76 07 81 88 ff ff 98 ae 76 07 81 88 ff ff H.v.......v.....
backtrace:
[<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
[<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
[<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
[<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
unreferenced object 0xffff88810776ad28 (size 56):
comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
hex dump (first 32 bytes):
05 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 ................
6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b kkkkkkkk....kkkk
backtrace:
[<ffffffff819bc492>] damon_new_region+0x22/0x1c0
[<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
[<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
[<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
[<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
[<ffffffff81237cf6>] kthread+0x2b6/0x380
[<ffffffff81097add>] ret_from_fork+0x2d/0x70
[<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
Link: https://lkml.kernel.org/r/20230925072100.3725620-1-ruanjinjie@huawei.com
Fixes: 9f86d624292c ("mm/damon/vaddr-test: remove unnecessary variables")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes") and
partially reverts 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") which have incrementally removed support for the
kernel memory accounting hard limit. Unfortunately it has turned out that
there is still userspace depending on the existence of
memory.kmem.limit_in_bytes [1]. The underlying functionality is not
really required but the non-existent file just confuses the userspace
which fails in the result. The patch to fix this on the userspace side
has been submitted but it is hard to predict how it will propagate through
the maze of 3rd party consumers of the software.
Now, reverting alone 86327e8eb94c is not an option because there is
another set of userspace which cannot cope with ENOTSUPP returned when
writing to the file. Therefore we have to go and revisit 58056f77502f as
well. There are two ways to go ahead. Either we give up on the
deprecation and fully revert 58056f77502f as well or we can keep
kmem.limit_in_bytes but make the write a noop and warn about the fact.
This should work for both known breaking workloads which depend on the
existence but do not depend on the hard limit enforcement.
Note to backporters to stable trees. a8c49af3be5f ("memcg: add per-memcg
total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
so the accounting is not done by obj_cgroup_charge_pages directly for v1
anymore. Prior kernels need to add it explicitly (thanks to Johannes for
pointing this out).
[akpm@linux-foundation.org: fix build - remove unused local]
Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1]
Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz
Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While stress-testing zswap a memory corruption was happening when writing
back pages. __frontswap_store used to check for duplicate entries before
attempting to store a page in zswap, this was because if the store fails
the old entry isn't removed from the tree. This change removes duplicate
entries in zswap_store before the actual attempt.
[cerasuolodomenico@gmail.com: add a warning and a comment, per Johannes]
Link: https://lkml.kernel.org/r/20230925130002.1929369-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230922172211.1704917-1-cerasuolodomenico@gmail.com
Fixes: 42c06a0e8ebe ("mm: kill frontswap")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When called with a swap entry that does not embed a PFN (e.g.
PTE_MARKER_POISONED or PTE_MARKER_UFFD_WP), the previous implementation of
set_huge_pte_at() would either cause a BUG() to fire (if CONFIG_DEBUG_VM
is enabled) or cause a dereference of an invalid address and subsequent
panic.
arm64's huge pte implementation supports multiple huge page sizes, some of
which are implemented in the page table with multiple contiguous entries.
So set_huge_pte_at() needs to work out how big the logical pte is, so that
it can also work out how many physical ptes (or pmds) need to be written.
It previously did this by grabbing the folio out of the pte and querying
its size.
However, there are cases when the pte being set is actually a swap entry.
But this also used to work fine, because for huge ptes, we only ever saw
migration entries and hwpoison entries. And both of these types of swap
entries have a PFN embedded, so the code would grab that and everything
still worked out.
But over time, more calls to set_huge_pte_at() have been added that set
swap entry types that do not embed a PFN. And this causes the code to go
bang. The triggering case is for the uffd poison test, commit
99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
added in v6.5-rc7. Although review shows that there are other call sites
that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
on arm64 because arm64 doesn't support UFFD WP.
Arguably, the root cause is really due to commit 18f3962953e4 ("mm:
hugetlb: kill set_huge_swap_pte_at()"), which aimed to simplify the
interface to the core code by removing set_huge_swap_pte_at() (which took
a page size parameter) and replacing it with calls to set_huge_pte_at()
where the size was inferred from the folio, as descibed above. While that
commit didn't break anything at the time, it did break the interface
because it couldn't handle swap entries without PFNs. And since then new
callers have come along which rely on this working. But given the
brokeness is only observable after commit 8a13897fb0da ("mm: userfaultfd:
support UFFDIO_POISON for hugetlbfs"), that one gets the Fixes tag.
Now that we have modified the set_huge_pte_at() interface to pass the huge
page size in the previous patch, we can trivially fix this issue.
Link: https://lkml.kernel.org/r/20230922115804.2043771-3-ryan.roberts@arm.com
Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Fix set_huge_pte_at() panic on arm64", v2.
This series fixes a bug in arm64's implementation of set_huge_pte_at(),
which can result in an unprivileged user causing a kernel panic. The
problem was triggered when running the new uffd poison mm selftest for
HUGETLB memory. This test (and the uffd poison feature) was merged for
v6.5-rc7.
Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
(correctly this time) to get it backported to v6.5, where the issue first
showed up.
Description of Bug
==================
arm64's huge pte implementation supports multiple huge page sizes, some of
which are implemented in the page table with multiple contiguous entries.
So set_huge_pte_at() needs to work out how big the logical pte is, so that
it can also work out how many physical ptes (or pmds) need to be written.
It previously did this by grabbing the folio out of the pte and querying
its size.
However, there are cases when the pte being set is actually a swap entry.
But this also used to work fine, because for huge ptes, we only ever saw
migration entries and hwpoison entries. And both of these types of swap
entries have a PFN embedded, so the code would grab that and everything
still worked out.
But over time, more calls to set_huge_pte_at() have been added that set
swap entry types that do not embed a PFN. And this causes the code to go
bang. The triggering case is for the uffd poison test, commit
99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
added in v6.5-rc7. Although review shows that there are other call sites
that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
on arm64 because arm64 doesn't support UFFD WP.
If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
it will dereference a bad pointer in page_folio():
static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}
Fix
===
The simplest fix would have been to revert the dodgy cleanup commit
18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
things have moved on, this would have required an audit of all the new
set_huge_pte_at() call sites to see if they should be converted to
set_huge_swap_pte_at(). As per the original intent of the change, it
would also leave us open to future bugs when people invariably get it
wrong and call the wrong helper.
So instead, I've added a huge page size parameter to set_huge_pte_at().
This means that the arm64 code has the size in all cases. It's a bigger
change, due to needing to touch the arches that implement the function,
but it is entirely mechanical, so in my view, low risk.
I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
s390, sparc (and additionally x86_64). I've additionally booted and run
mm selftests against arm64, where I observe the uffd poison test is fixed,
and there are no other regressions.
This patch (of 2):
In order to fix a bug, arm64 needs to be told the size of the huge page
for which the pte is being set in set_huge_pte_at(). Provide for this by
adding an `unsigned long sz` parameter to the function. This follows the
same pattern as huge_pte_clear().
This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, parisc, powerpc,
riscv, s390, sparc). The actual arm64 bug will be fixed in a separate
commit.
No behavioral changes intended.
Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx]
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change]
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When updating the maple tree iterator to avoid rewalks, an issue was
introduced when shifting beyond the limits. This can be seen by trying to
go to the previous address of 0, which would set the maple node to
MAS_NONE and keep the range as the last entry.
Subsequent calls to mas_find() would then search upwards from mas->last
and skip the value at mas->index/mas->last. This showed up as a bug in
mprotect which skips the actual VMA at the current range after attempting
to go to the previous VMA from 0.
Since MAS_NONE may already be set when searching for a value that isn't
contained within a node, changing the handling of MAS_NONE in mas_find()
would make the code more complicated and error prone. Furthermore, there
was no way to tell which limit was hit, and thus which action to take
(next or the entry at the current range).
This solution is to add two states to track what happened with the
previous iterator action. This allows for the expected behaviour of the
next command to return the correct item (either the item at the range
requested, or the next/previous).
Tests are also added and updated accordingly.
Link: https://lkml.kernel.org/r/20230921181236.509072-3-Liam.Howlett@oracle.com
Link: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
Link: https://lore.kernel.org/linux-mm/20230921181236.509072-1-Liam.Howlett@oracle.com/
Fixes: 39193685d585 ("maple_tree: try harder to keep active node with mas_prev()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Pedro Falcato <pedro.falcato@gmail.com>
Closes: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
Closes: https://bugs.archlinux.org/task/79656
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "maple_tree: Fix mas_prev() state regression".
Pedro Falcato retported an mprotect regression [1] which was bisected back
to the iterator changes for maple tree. Root cause analysis showed the
mas_prev() running off the end of the VMA space (previous from 0) followed
by mas_find(), would skip the first value.
This patchset introduces maple state underflow/overflow so the sequence of
calls on the maple state will return what the user expects.
Users who encounter this bug may see mprotect(), userfaultfd_register(),
and mlock() fail on VMAs mapped with address 0.
This patch (of 2):
Instead of constantly checking each possibility of the maple state,
create a fast path that will skip over checking unlikely states.
Link: https://lkml.kernel.org/r/20230921181236.509072-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230921181236.509072-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|