summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-12perf trace: Add --summary-mode optionNamhyung Kim
The --summary-mode option will select how to show the syscall summary at the end. By default, it'll show the summary for each thread and it's the same as if --summary-mode=thread is passed. The other option is to show total summary, which is --summary-mode=total. I'd like to have this instead of a separate option like --total-summary because we may want to add a new summary mode (by cgroup) later. $ sudo ./perf trace -as --summary-mode=total sleep 1 Summary of events: total, 21580 events syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ epoll_wait 1305 0 14716.712 0.000 11.277 551.529 8.87% futex 1256 89 13331.197 0.000 10.614 733.722 15.49% poll 669 0 6806.618 0.000 10.174 459.316 11.77% ppoll 220 0 3968.797 0.000 18.040 516.775 25.35% clock_nanosleep 1 0 1000.027 1000.027 1000.027 1000.027 0.00% epoll_pwait 21 0 592.783 0.000 28.228 522.293 88.29% nanosleep 16 0 60.515 0.000 3.782 10.123 33.33% ioctl 510 0 4.284 0.001 0.008 0.182 8.84% recvmsg 1434 775 3.497 0.001 0.002 0.174 6.37% write 1393 0 2.854 0.001 0.002 0.017 1.79% read 1063 100 2.236 0.000 0.002 0.083 5.11% ... Reviewed-by: Howard Chu <howardchu95@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250205205443.1986408-5-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf tools: Get rid of now-unused rb_resort.hNamhyung Kim
It was only used in perf trace and it switched to use hashmap instead. Let's delete the code. Reviewed-by: Howard Chu <howardchu95@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250205205443.1986408-4-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf trace: Convert syscall_stats to hashmapNamhyung Kim
It was using a RBtree-based int-list as a hash and a custom resort logic for that. As we have hashmap, let's convert to it and add a custom sort function for the hashmap entries using an array. It should be faster and more light-weighted. It's also to prepare supporting system-wide syscall stats. No functional changes intended. Reviewed-by: Howard Chu <howardchu95@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250205205443.1986408-3-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf trace: Allocate syscall stats only if summary is onNamhyung Kim
The syscall stats are used only when summary is requested. Let's avoid unnecessary operations. While at it, let's pass 'trace' pointer directly instead of passing 'output' file pointer and 'summary' option in the 'trace' separately. Reviewed-by: Howard Chu <howardchu95@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250205205443.1986408-2-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf tests: Fix Tool PMU test segfaultJames Clark
tool_pmu__event_to_str() now handles skipped events by returning NULL, so it's wrong to re-check for a skip on the resulting string. Calling tool_pmu__skip_event() with a NULL string results in a segfault so remove the unnecessary skip to fix it: $ perf test -vv "parsing with PMU name" 12.2: Parsing with PMU name: ... ---- unexpected signal (11) ---- 12.2: Parsing with PMU name : FAILED! Fixes: ee8aef2d2321 ("perf tools: Add skip check in tool_pmu__event_to_str()") Signed-off-by: James Clark <james.clark@linaro.org> Reported-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250212163859.1489916-1-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-10perf tools: Add skip check in tool_pmu__event_to_str()Kan Liang
Some topdown related metrics may fail on hybrid machines. $ perf stat -M tma_frontend_bound Cannot resolve IDs for tma_frontend_bound: cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (8 * cpu_atom@CPU_CLK_UNHALTED.CORE@) In the find_tool_events(), the tool_pmu__event_to_str() is used to compare the tool_events. It only checks the event name, no PMU or arch. So the tool_events[TOOL_PMU__EVENT_SLOTS] is set to true, because the p-core Topdown metrics has "slots" event. The tool_events is shared. So when parsing the e-core metrics, the "slots" is automatically added. The "slots" event as a tool event should only be available on arm64. It has a different meaning on X86. The tool_pmu__skip_event() intends handle the case. Apply it for tool_pmu__event_to_str() as well. There is a lack of sanity check in the expr__get_id(). Add the check. Closes: https://lore.kernel.org/lkml/608077bc-4139-4a97-8dc4-7997177d95c4@linux.intel.com/ Fixes: 069057239a67 ("perf tool_pmu: Move expr literals to tool_pmu") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: thomas.falcon@intel.com Link: https://lore.kernel.org/r/20250207152844.302167-1-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-10perf tools: Deadcode removalDr. David Alan Gilbert
The last use of machine__fprintf_vmlinux_path() was removed in 2011 by commit ab81f3fd350c ("perf top: Reuse the 'report' hist_entry/hists classes") mmap_cpu_mask__duplicate() was added in 2021 by commit 6bd006c6eb7f ("perf mmap: Introduce mmap_cpu_mask__duplicate()") but hasn't been used since. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Tested-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250204220545.456435-1-linux@treblig.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05Merge tag 'v6.14-rc1' into perf-tools-nextNamhyung Kim
To get the various fixes in the current master. Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf stat: Changes to event name uniquificationIan Rogers
The existing logic would disable uniquification on an evlist or enable it per evsel, this is unfortunate as uniquification is most needed when events have the same name and so the whole evlist must be considered. Change the initial disable uniquify on an evlist processing to also set a needs_uniquify flag, for cases like the matching event names. This must be done as an initial pass as uniquification of an event name will change the behavior of the check. Keep the per counter uniquification but now only uniquify event names when the needs_uniquify flag is set. Before this change a hwmon like temp1 wouldn't be uniquified and afterwards it will (ie the PMU is added to the temp1 event's name). Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf stat: Don't merge counters purely on nameIan Rogers
Counter merging was added in commit 942c5593393d ("perf stat: Add perf_stat_merge_counters()"), however, it merges events with the same name on different PMUs regardless of whether the different PMUs are actually of the same type (ie they differ only in the suffix on the PMU). For hwmon events there may be a temp1 event on every PMU, but the PMU names are all unique and don't differ just by a suffix. The merging is over eager and will merge all the hwmon counters together meaning an aggregated and very large temp1 value is shown. The same would be true for say cache events and memory controller events where the PMUs differ but the event names are the same. Fix the problem by correctly saying two PMUs alias when they differ only by suffix. Note, there is an overlap with evsel's merged_stat with aggregation and the evsel's metric_leader where aggregation happens for metrics. Fixes: 942c5593393d ("perf stat: Add perf_stat_merge_counters()") Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf pmu: Rename name matching for no suffix or wildcard variantsIan Rogers
Wildcard PMU naming will match a name like pmu_1 to a PMU name like pmu_10 but not to a PMU name like pmu_2 as the suffix forms part of the match. No suffix matching will match pmu_10 to either pmu_1 or pmu_2. Add or rename matching functions on PMU to make it clearer what kind of matching is being performed. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-4-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf pmus: Restructure pmu_read_sysfs to scan fewer PMUsIan Rogers
Rather than scanning core or all PMUs, allow pmu_read_sysfs to read some combination of core, other, hwmon and tool PMUs. The PMUs that should be read and are already read are held as bitmaps. It is known that a "hwmon_" prefix is necessary for a hwmon PMU's name, similarly with "tool", so only scan those PMUs in situations the PMU name or the PMU's type number make sense to. The number of openat system calls reduces from 276 to 98 for a hwmon event. The number of openats for regular perf events isn't changed. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf evsel: Reduce scanning core PMUs in is_hybridIan Rogers
evsel__is_hybrid returns true if there are multiple core PMUs and the evsel is for a core PMU. Determining the number of core PMUs can require loading/scanning PMUs. There's no point doing the scanning if evsel for the is_hybrid test isn't core so reorder the tests to reduce PMU scanning. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf test: Fix Hwmon PMU test endianess issueThomas Richter
perf test 11 hwmon fails on s390 with this error # ./perf test -Fv 11 --- start --- ---- end ---- 11.1: Basic parsing test : Ok --- start --- Testing 'temp_test_hwmon_event1' Using CPUID IBM,3931,704,A01,3.7,002f temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/ FAILED tests/hwmon_pmu.c:189 Unexpected config for 'temp_test_hwmon_event1', 292470092988416 != 655361 ---- end ---- 11.2: Parsing without PMU name : FAILED! --- start --- Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/' FAILED tests/hwmon_pmu.c:189 Unexpected config for 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/', 292470092988416 != 655361 ---- end ---- 11.3: Parsing with PMU name : FAILED! # The root cause is in member test_event::config which is initialized to 0xA0001 or 655361. During event parsing a long list event parsing functions are called and end up with this gdb call stack: #0 hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8, term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623 #1 hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662 #2 0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519 #3 0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8, head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1545 #4 0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8, list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090, auto_merge_stats=true, alternate_hw_config=10) at util/parse-events.c:1508 #5 0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8, event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10, const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0) at util/parse-events.c:1592 #6 0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8, scanner=0x16878c0) at util/parse-events.y:293 #7 0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8 "temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8) at util/parse-events.c:1867 #8 0x000000000126a1e8 in __parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0, err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true, fake_tp=false) at util/parse-events.c:2136 #9 0x00000000011e36aa in parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8) at /root/linux/tools/perf/util/parse-events.h:41 #10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false) at tests/hwmon_pmu.c:164 #11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false) at tests/hwmon_pmu.c:219 #12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368 <suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23 where the attr::config is set to value 292470092988416 or 0x10a0000000000 in line 625 of file ./util/hwmon_pmu.c: attr->config = key.type_and_num; However member key::type_and_num is defined as union and bit field: union hwmon_pmu_event_key { long type_and_num; struct { int num :16; enum hwmon_type type :8; }; }; s390 is big endian and Intel is little endian architecture. The events for the hwmon dummy pmu have num = 1 or num = 2 and type is set to HWMON_TYPE_TEMP (which is 10). On s390 this assignes member key::type_and_num the value of 0x10a0000000000 (which is 292470092988416) as shown in above trace output. Fix this and export the structure/union hwmon_pmu_event_key so the test shares the same implementation as the event parsing functions for union and bit fields. This should avoid endianess issues on all platforms. Output after: # ./perf test -F 11 11.1: Basic parsing test : Ok 11.2: Parsing without PMU name : Ok 11.3: Parsing with PMU name : Ok # Fixes: 531ee0fd4836 ("perf test: Add hwmon "PMU" test") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250131112400.568975-1-tmricht@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf test: Use cycles event in perf record test for leader_samplingThomas Richter
On s390 the event instructions can not be used for recording. This event is only supported by perf stat. Change the event from instructions to cycles in subtest test_leader_sampling. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Suggested-by: James Clark <james.clark@linaro.org> Reviewed-by: James Clark <james.clark@linaro.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250131102756.4185235-3-tmricht@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-04perf test: Fix perf record test for precise_maxThomas Richter
On s390 the event instructions can not be used for recording. This event is only supported by perf stat. Test that each event cycles and instructions supports sampling. If the event can not be sampled, skip it. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Suggested-by: James Clark <james.clark@linaro.org> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250131102756.4185235-2-tmricht@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-03perf script: force stdin for flamegraph in live modeAnubhav Shelat
Currently, running "perf script flamegraph -a -F 99 sleep 1" should produce flamegraph.html containing the flamegraph. Howevever, it gives a segmentation fault. This is caused because the flamegraph.py script is supposed to take as input the output of "perf record", which should be in stdin. This would require passing "-i -" to flamegraph.py. However, the "flamegraph-report" script causes "perf script" command to take the "-i -" option instead of flamegraph.py, which causes no problem for "perf script", but causes a seg fault since flamegraph.py has no input file. To fix this I added the "-i -" option directly to the flamegraph-report script to ensure flamegraph.py gets input from stdin. Signed-off-by: Anubhav Shelat <ashelat@redhat.com> Tested-by: Michael Petlan <mpetlan@redhat.com> Link: https://lore.kernel.org/r/20250131145704.3164542-2-ashelat@redhat.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-03perf test: Extra verbosity and hypervisor skip for tpebs testIan Rogers
When not running as root and with higher perf event paranoia values the perf record forked by TPEBS can fail to attach to the process. Skip the test in these scenarios. Intel TPEBS test skips on non-Intel CPUs. On Intel CPUs under a hypervisor the cache-misses event may not be present or precise. Skip the test under this condition. Refactor the output code to be placed in a file so that on a signal the file can be dumped. This was necessary to catch the issue above as the failing perf record command would fail without output. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Cc: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250130170135.5817-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-02Linux 6.14-rc1v6.14-rc1Linus Torvalds
2025-02-02Merge tag 'turbostat-2025.02.02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat updates from Len Brown: - Fix regression that affinitized forked child in one-shot mode. - Harden one-shot mode against hotplug online/offline - Enable RAPL SysWatt column by default - Add initial PTL, CWF platform support - Harden initial PMT code in response to early use - Enable first built-in PMT counter: CWF c1e residency - Refuse to run on unsupported platforms without --force, to encourage updating to a version that supports the system, and to avoid no-so-useful measurement results * tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (25 commits) tools/power turbostat: version 2025.02.02 tools/power turbostat: Add CPU%c1e BIC for CWF tools/power turbostat: Harden one-shot mode against cpu offline tools/power turbostat: Fix forked child affinity regression tools/power turbostat: Add tcore clock PMT type tools/power turbostat: version 2025.01.14 tools/power turbostat: Allow adding PMT counters directly by sysfs path tools/power turbostat: Allow mapping multiple PMT files with the same GUID tools/power turbostat: Add PMT directory iterator helper tools/power turbostat: Extend PMT identification with a sequence number tools/power turbostat: Return default value for unmapped PMT domains tools/power turbostat: Check for non-zero value when MSR probing tools/power turbostat: Enhance turbostat self-performance visibility tools/power turbostat: Add fixed RAPL PSYS divisor for SPR tools/power turbostat: Fix PMT mmaped file size rounding tools/power turbostat: Remove SysWatt from DISABLED_BY_DEFAULT tools/power turbostat: Add an NMI column tools/power turbostat: add Busy% to "show idle" tools/power turbostat: Introduce --force parameter tools/power turbostat: Improve --help output ...
2025-02-02Merge tag 'sh-for-v6.14-tag1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux Pull sh updates from John Paul Adrian Glaubitz: "Fixes and improvements for sh: - replace seq_printf() with the more efficient seq_put_decimal_ull_width() to increase performance when stress reading /proc/interrupts (David Wang) - migrate sh to the generic rule for built-in DTB to help avoid race conditions during parallel builds which can occur because Kbuild decends into arch/*/boot/dts twice (Masahiro Yamada) - replace select with imply in the board Kconfig for enabling hardware with complex dependencies. This addresses warnings which were reported by the kernel test robot (Geert Uytterhoeven)" * tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux: sh: boards: Use imply to enable hardware with complex dependencies sh: Migrate to the generic rule for built-in DTB sh: irq: Use seq_put_decimal_ull_width() for decimal values
2025-02-02tools/power turbostat: version 2025.02.02Len Brown
Summary of Changes since 2024.11.30: Fix regression in 2023.11.07 that affinitized forked child in one-shot mode. Harden one-shot mode against hotplug online/offline Enable RAPL SysWatt column by default. Add initial PTL, CWF platform support. Harden initial PMT code in response to early use. Enable first built-in PMT counter: CWF c1e residency Refuse to run on unsupported platforms without --force, to encourage updating to a version that supports the system, and to avoid no-so-useful measurement results. Signed-off-by: Len Brown <len.brown@intel.com>
2025-02-01Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h
2025-02-01Merge tag 'mips_6.14_1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fix from Thomas Bogendoerfer: "Revert commit breaking sysv ipc for o32 ABI" * tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: Revert "mips: fix shmctl/semctl/msgctl syscall for o32"
2025-02-01Merge tag 'v6.14-rc-smb3-client-fixes-part2' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - various updates for special file handling: symlink handling, support for creating sockets, cleanups, new mount options (e.g. to allow disabling using reparse points for them, and to allow overriding the way symlinks are saved), and fixes to error paths - fix for kerberos mounts (allow IAKerb) - SMB1 fix for stat and for setting SACL (auditing) - fix an incorrect error code mapping - cleanups" * tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: (21 commits) cifs: Fix parsing native symlinks directory/file type cifs: update internal version number cifs: Add support for creating WSL-style symlinks smb3: add support for IAKerb cifs: Fix struct FILE_ALL_INFO cifs: Add support for creating NFS-style symlinks cifs: Add support for creating native Windows sockets cifs: Add mount option -o reparse=none cifs: Add mount option -o symlink= for choosing symlink create type cifs: Fix creating and resolving absolute NT-style symlinks cifs: Simplify reparse point check in cifs_query_path_info() function cifs: Remove symlink member from cifs_open_info_data union cifs: Update description about ACL permissions cifs: Rename struct reparse_posix_data to reparse_nfs_data_buffer and move to common/smb2pdu.h cifs: Remove struct reparse_posix_data from struct cifs_open_info_data cifs: Remove unicode parameter from parse_reparse_point() function cifs: Fix getting and setting SACLs over SMB1 cifs: Remove intermediate object of failed create SFU call cifs: Validate EAs for WSL reparse points cifs: Change translation of STATUS_PRIVILEGE_NOT_HELD to -EPERM ...
2025-02-01Merge tag 'driver-core-6.14-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull debugfs fix from Greg KH: "Here is a single debugfs fix from Al to resolve a reported regression in the driver-core tree. It has been reported to fix the issue" * tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Fix the missing initializations in __debugfs_file_get()
2025-02-01Merge tag 'mm-hotfixes-stable-2025-02-01-03-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 13 are for MM and 8 are for non-MM. All are singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) MAINTAINERS: include linux-mm for xarray maintenance revert "xarray: port tests to kunit" MAINTAINERS: add lib/test_xarray.c mailmap, MAINTAINERS, docs: update Carlos's email address mm/hugetlb: fix hugepage allocation for interleaved memory nodes mm: gup: fix infinite loop within __get_longterm_locked mm, swap: fix reclaim offset calculation error during allocation .mailmap: update email address for Christopher Obbard kfence: skip __GFP_THISNODE allocations on NUMA systems nilfs2: fix possible int overflows in nilfs_fiemap() mm: compaction: use the proper flag to determine watermarks kernel: be more careful about dup_mmap() failures and uprobe registering mm/fake-numa: handle cases with no SRAT info mm: kmemleak: fix upper boundary check for physical address objects mailmap: add an entry for Hamza Mahfooz MAINTAINERS: mailmap: update Yosry Ahmed's email address scripts/gdb: fix aarch64 userspace detection in get_current_task mm/vmscan: accumulate nr_demoted for accurate demotion statistics ocfs2: fix incorrect CPU endianness conversion causing mount failure mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc() ...
2025-02-01Merge tag 'media/v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fix from Mauro Carvalho Chehab: "A revert for a regression in the uvcvideo driver" * tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: Revert "media: uvcvideo: Require entities to have a non-zero unique ID"
2025-02-01MAINTAINERS: include linux-mm for xarray maintenanceAndrew Morton
MM developers have an interest in the xarray code. Cc: David Gow <davidgow@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Tamir Duberstein <tamird@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01revert "xarray: port tests to kunit"Andrew Morton
Revert c7bb5cf9fc4e ("xarray: port tests to kunit"). It broke the build when compiing the xarray userspace test harness code. Reported-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lkml.kernel.org/r/07cf896e-adf8-414f-a629-a808fc26014a@oracle.com Cc: David Gow <davidgow@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tamir Duberstein <tamird@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01MAINTAINERS: add lib/test_xarray.cTamir Duberstein
Ensure test-only changes are sent to the relevant maintainer. Link: https://lkml.kernel.org/r/20250129-xarray-test-maintainer-v1-1-482e31f30f47@gmail.com Signed-off-by: Tamir Duberstein <tamird@gmail.com> Cc: Mattew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mailmap, MAINTAINERS, docs: update Carlos's email addressCarlos Bilbao
Update .mailmap to reflect my new (and final) primary email address, carlos.bilbao@kernel.org. Also update contact information in files Documentation/translations/sp_SP/index.rst and MAINTAINERS. Link: https://lkml.kernel.org/r/20250130012248.1196208-1-carlos.bilbao@kernel.org Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: Carlos Bilbao <bilbao@vt.edu> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mattew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm/hugetlb: fix hugepage allocation for interleaved memory nodesRitesh Harjani (IBM)
gather_bootmem_prealloc() assumes the start nid as 0 and size as num_node_state(N_MEMORY). That means in case if memory attached numa nodes are interleaved, then gather_bootmem_prealloc_parallel() will fail to scan few of these nodes. Since memory attached numa nodes can be interleaved in any fashion, hence ensure that the current code checks for all numa node ids (.size = nr_node_ids). Let's still keep max_threads as N_MEMORY, so that it can distributes all nr_node_ids among the these many no. threads. e.g. qemu cmdline ======================== numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" mem_cmd="-object memory-backend-ram,id=mem1,size=16G" w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): ========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 0 kB with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): =========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 2 HugePages_Free: 2 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 2097152 kB Link: https://lkml.kernel.org/r/f8d8dad3a5471d284f54185f65d575a6aaab692b.1736592534.git.ritesh.list@gmail.com Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reported-by: Pavithra Prakash <pavrampu@linux.ibm.com> Suggested-by: Muchun Song <muchun.song@linux.dev> Tested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Gang Li <gang.li@linux.dev> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm: gup: fix infinite loop within __get_longterm_lockedZhaoyang Huang
We can run into an infinite loop in __get_longterm_locked() when collect_longterm_unpinnable_folios() finds only folios that are isolated from the LRU or were never added to the LRU. This can happen when all folios to be pinned are never added to the LRU, for example when vm_ops->fault allocated pages using cma_alloc() and never added them to the LRU. Fix it by simply taking a look at the list in the single caller, to see if anything was added. [zhaoyang.huang@unisoc.com: move definition of local] Link: https://lkml.kernel.org/r/20250122012604.3654667-1-zhaoyang.huang@unisoc.com Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()") Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aijun Sun <aijun.sun@unisoc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm, swap: fix reclaim offset calculation error during allocationKairui Song
There is a code error that will cause the swap entry allocator to reclaim and check the whole cluster with an unexpected tail offset instead of the part that needs to be reclaimed. This may cause corruption of the swap map, so fix it. Link: https://lkml.kernel.org/r/20250130115131.37777-1-ryncsn@gmail.com Fixes: 3b644773eefd ("mm, swap: reduce contention on device lock") Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chris Li <chrisl@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01.mailmap: update email address for Christopher ObbardChristopher Obbard
Update my email address. Link: https://lkml.kernel.org/r/20250122-wip-obbardc-update-email-v2-1-12bde6b79ad0@linaro.org Signed-off-by: Christopher Obbard <christopher.obbard@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01kfence: skip __GFP_THISNODE allocations on NUMA systemsMarco Elver
On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on a particular node, and failure to allocate on the desired node will result in a failed allocation. Skip __GFP_THISNODE allocations if we are running on a NUMA system, since KFENCE can't guarantee which node its pool pages are allocated on. Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com Fixes: 236e9f153852 ("kfence: skip all GFP_ZONEMASK allocations") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Alexander Potapenko <glider@google.com> Cc: Chistoph Lameter <cl@linux.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01nilfs2: fix possible int overflows in nilfs_fiemap()Nikita Zhandarovich
Since nilfs_bmap_lookup_contig() in nilfs_fiemap() calculates its result by being prepared to go through potentially maxblocks == INT_MAX blocks, the value in n may experience an overflow caused by left shift of blkbits. While it is extremely unlikely to occur, play it safe and cast right hand expression to wider type to mitigate the issue. Found by Linux Verification Center (linuxtesting.org) with static analysis tool SVACE. Link: https://lkml.kernel.org/r/20250124222133.5323-1-konishi.ryusuke@gmail.com Fixes: 622daaff0a89 ("nilfs2: fiemap support") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm: compaction: use the proper flag to determine watermarksyangge
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. There is 16GB of free CMA memory on a NUMA node, which is sufficient to pass the order-0 watermark check, causing the __compaction_suitable() function to consistently return true. For costly allocations, if the __compaction_suitable() function always returns true, it causes the __alloc_pages_slowpath() function to fail to exit at the appropriate point. This prevents timely fallback to allocating memory on other nodes, ultimately resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here We could use the real unmovable allocation context to have __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass the order-0 check anymore once the non-CMA part is exhausted. There is some risk that in some different scenario the compaction could in fact migrate pages from the exhausted non-CMA part of the zone to the CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY allocations should be affected in the immediate "goto nopage" when compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't fail without trying to compact-migrate the non-CMA pageblocks into CMA pageblocks first, so it should be fine. After this fix, it only takes a few tens of seconds to start a 32GB virtual machine with device passthrough functionality. Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/ Link: https://lkml.kernel.org/r/1737788037-8439-1-git-send-email-yangge1116@126.com Signed-off-by: yangge <yangge1116@126.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01kernel: be more careful about dup_mmap() failures and uprobe registeringLiam R. Howlett
If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm/fake-numa: handle cases with no SRAT infoBruno Faccini
Handle more gracefully cases where no SRAT information is available, like in VMs with no Numa support, and allow fake-numa configuration to complete successfully in these cases Link: https://lkml.kernel.org/r/20250127171623.1523171-1-bfaccini@nvidia.com Fixes: 63db8170bf34 (“mm/fake-numa: allow later numa node hotplug”) Signed-off-by: Bruno Faccini <bfaccini@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Len Brown <lenb@kernel.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm: kmemleak: fix upper boundary check for physical address objectsCatalin Marinas
Memblock allocations are registered by kmemleak separately, based on their physical address. During the scanning stage, it checks whether an object is within the min_low_pfn and max_low_pfn boundaries and ignores it otherwise. With the recent addition of __percpu pointer leak detection (commit 6c99d4eb7c5e ("kmemleak: enable tracking for percpu pointers")), kmemleak started reporting leaks in setup_zone_pageset() and setup_per_cpu_pageset(). These were caused by the node_data[0] object (initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn) boundary. The non-strict upper boundary check introduced by commit 84c326299191 ("mm: kmemleak: check physical address when scan") causes the pg_data_t object to be ignored (not scanned) and the __percpu pointers it contains to be reported as leaks. Make the max_low_pfn upper boundary check strict when deciding whether to ignore a physical address object and not scan it. Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com Fixes: 84c326299191 ("mm: kmemleak: check physical address when scan") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Cc: <stable@vger.kernel.org> [6.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mailmap: add an entry for Hamza MahfoozHamza Mahfooz
Map my previous work email to my current one. Link: https://lkml.kernel.org/r/20250120205659.139027-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: David S. Miller <davem@davemloft.net> Cc: Hans verkuil <hverkuil@xs4all.nl> Cc: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01MAINTAINERS: mailmap: update Yosry Ahmed's email addressYosry Ahmed
Moving to a linux.dev email address. Link: https://lkml.kernel.org/r/20250123231344.817358-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01scripts/gdb: fix aarch64 userspace detection in get_current_taskJan Kiszka
At least recent gdb releases (seen with 14.2) return SP_EL0 as signed long which lets the right-shift always return 0. Link: https://lkml.kernel.org/r/dcd2fabc-9131-4b48-8419-6444e2d67454@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Barry Song <baohua@kernel.org> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm/vmscan: accumulate nr_demoted for accurate demotion statisticsLi Zhijian
In shrink_folio_list(), demote_folio_list() can be called 2 times. Currently stat->nr_demoted will only store the last nr_demoted( the later nr_demoted is always zero, the former nr_demoted will get lost), as a result number of demoted pages is not accurate. Accumulate the nr_demoted count across multiple calls to demote_folio_list(), ensuring accurate reporting of demotion statistics. [lizhijian@fujitsu.com: introduce local nr_demoted to fix nr_reclaimed double counting] Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01ocfs2: fix incorrect CPU endianness conversion causing mount failureHeming Zhao
Commit 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()") introduced a regression bug. The blksz_bits value is already converted to CPU endian in the previous code; therefore, the code shouldn't use le32_to_cpu() anymore. Link: https://lkml.kernel.org/r/20250121112204.12834-1-heming.zhao@suse.com Fixes: 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc()Hyeonggon Yoo
Commit c1b3bb73d55e ("mm/zsmalloc: use zpdesc in trylock_zspage()/lock_zspage()") introduces is_first_zpdesc() function. However, the function is only used when CONFIG_DEBUG_VM=y. When building with LLVM=1 and W=1 option, the following warning is generated: $ make -j12 W=1 LLVM=1 mm/zsmalloc.o mm/zsmalloc.c:455:20: error: function 'is_first_zpdesc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration] 455 | static inline bool is_first_zpdesc(struct zpdesc *zpdesc) | ^~~~~~~~~~~~~~~ 1 error generated. Fix the warning by adding __maybe_unused attribute to the function. No functional change intended. Link: https://lkml.kernel.org/r/20250127231631.4363-1-42.hyeyoo@gmail.com Fixes: c1b3bb73d55e ("mm/zsmalloc: use zpdesc in trylock_zspage()/lock_zspage()") Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501240958.4ILzuBrH-lkp@intel.com/ Cc: Alex Shi <alexs@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm/vmscan: fix hard LOCKUP in function isolate_lru_foliosliuye
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog. watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30 lruvec->lru_lock owner: PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash> Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0 cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark. perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB. If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur. In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB. [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000 About the Fixes: Why did it take eight years to be discovered? The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark. If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect. notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem. Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <liuye@kylinos.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01sh: boards: Use imply to enable hardware with complex dependenciesGeert Uytterhoeven
If CONFIG_I2C=n: WARNING: unmet direct dependencies detected for SND_SOC_AK4642 Depends on [n]: SOUND [=y] && SND [=y] && SND_SOC [=y] && I2C [=n] Selected by [y]: - SH_7724_SOLUTION_ENGINE [=y] && CPU_SUBTYPE_SH7724 [=y] && SND_SIMPLE_CARD [=y] WARNING: unmet direct dependencies detected for SND_SOC_DA7210 Depends on [n]: SOUND [=y] && SND [=y] && SND_SOC [=y] && SND_SOC_I2C_AND_SPI [=n] Selected by [y]: - SH_ECOVEC [=y] && CPU_SUBTYPE_SH7724 [=y] && SND_SIMPLE_CARD [=y] Fix this by replacing select by imply, instead of adding a dependency on I2C. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501240836.OvXqmANX-lkp@intel.com/ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>