Age | Commit message (Collapse) | Author |
|
commit 2fbdb6d8e03b70668c0876e635506540ae92ab05 upstream.
On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is
unsigned long. This hence triggers a compilation warning as min()
asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t
solves this issue and works on other target architectures as well.
Compilation warning details:
kernel/trace/trace.c: In function 'tracing_splice_read_pipe':
./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
^
./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck'
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~
...
kernel/trace/trace.c:6771:8: note: in expansion of macro 'min'
min((size_t)trace_seq_used(&iter->seq),
^~~
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com
Fixes: f5178c41bb43 ("tracing: Fix oob write in trace_seq_to_buffer()")
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 196a062641fe68d9bfe0ad36b6cd7628c99ad22c ]
Binary printing functions are using printf() type of format, and compiler
is not happy about them as is:
kernel/trace/trace.c:3292:9: error: function ‘trace_vbprintk’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]
kernel/trace/trace_seq.c:182:9: error: function ‘trace_seq_bprintf’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]
Fix the compilation errors by adding __printf() attribute.
While at it, move existing __printf() attributes from the implementations
to the declarations. IT also fixes incorrect attribute parameters that are
used for trace_array_printk().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20250321144822.324050-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f5178c41bb43444a6008150fe6094497135d07cb upstream.
syzbot reported this bug:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260
CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xc3/0x670 mm/kasan/report.c:521
kasan_report+0xe0/0x110 mm/kasan/report.c:634
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
__asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
....
==================================================================
It has been reported that trace_seq_to_buffer() tries to copy more data
than PAGE_SIZE to buf. Therefore, to prevent this, we should use the
smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument.
Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com
Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com
Fixes: 3c56819b14b0 ("tracing: splice support for tracing_pipe")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 22bec11a569983f39c6061cb82279e7de9e3bdfc upstream.
When the function tracing_set_tracer() switched over to using the guard()
infrastructure, it did not need to save the 'ret' variable and would just
return the value when an error arised, instead of setting ret and jumping
to an out label.
When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the
"ret" variable to be initialized to zero and had set 'ret' while holding
an arch_spin_lock() (not used by guard), and then upon releasing the lock
it would check 'ret' and exit if set. But because ret was only set when an
error occurred while holding the locks, 'ret' would be used uninitialized
if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should
be self contain. Make sure 'ret' is also set when no error occurred.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/
Fixes: d33b10c0c73ad ("tracing: Switch trace.c code over to use guard()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 60b8f711143de7cd9c0f55be0fe7eb94b19eb5c7 ]
Currently if __tracing_resize_ring_buffer() returns an error, the
tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory
issue that caused the function to fail. If the ring buffer is memory
mapped, then the resizing of the ring buffer will be disabled. But if the
user tries to resize the buffer, it will get an -ENOMEM returned, which is
confusing because there is plenty of memory. The actual error returned was
-EBUSY, which would make much more sense to the user.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d33b10c0c73adca00f72bf4a153a07b7f5f34715 ]
There are several functions in trace.c that have "goto out;" or
equivalent on error in order to release locks or free values that were
allocated. This can be error prone or just simply make the code more
complex.
Switch every location that ends with unlocking a mutex or freeing on error
over to using the guard(mutex)() and __free() infrastructure to let the
compiler worry about releasing locks. This makes the code easier to read
and understand.
There's one place that should probably return an error but instead return
0. This does not change the return as the only changes are to do the
conversion without changing the logic. Fixing that location will have to
come later.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20241224221413.7b8c68c3@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Stable-dep-of: 60b8f711143d ("tracing: Have the error of __tracing_resize_ring_buffer() passed to user")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 129fe718819cc5e24ea2f489db9ccd4371f0c6f6 upstream.
When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:
BUG: unable to handle page fault for address: ffffe97bd00025c8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 2862f3067 P4D 2862f3067 PUD 0
Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:validate_page_before_insert+0x5/0xb0
Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
FS: 00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x1f
? __die+0x2e/0x40
? page_fault_oops+0x157/0x2b0
? search_module_extables+0x53/0x80
? validate_page_before_insert+0x5/0xb0
? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
? __bad_area_nosemaphore+0x16e/0x1b0
? bad_area_nosemaphore+0x16/0x20
? do_kern_addr_fault+0x77/0x90
? exc_page_fault+0x22b/0x230
? asm_exc_page_fault+0x2b/0x30
? validate_page_before_insert+0x5/0xb0
? vm_insert_pages+0x151/0x400
__rb_map_vma+0x21f/0x3f0
ring_buffer_map+0x21b/0x2f0
tracing_buffers_mmap+0x70/0xd0
__mmap_region+0x6f0/0xbd0
mmap_region+0x7f/0x130
do_mmap+0x475/0x610
vm_mmap_pgoff+0xf2/0x1d0
ksys_mmap_pgoff+0x166/0x200
__x64_sys_mmap+0x37/0x50
x64_sys_call+0x1670/0x1d70
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The reason was that the code that maps the ring buffer pages to user space
has:
page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
And uses that in:
vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.
If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home
Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0 upstream.
If a large count is provided, it will trigger a warning in bitmap_parse_user.
Also check zero for it.
Cc: stable@vger.kernel.org
Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions")
Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com
Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd
Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit afd2627f727b89496d79a6b934a025fc916d4ded upstream.
The TP_printk() portion of a trace event is executed at the time a event
is read from the trace. This can happen seconds, minutes, hours, days,
months, years possibly later since the event was recorded. If the print
format contains a dereference to a string via "%s", and that string was
allocated, there's a chance that string could be freed before it is read
by the trace file.
To protect against such bugs, there are two functions that verify the
event. The first one is test_event_printk(), which is called when the
event is created. It reads the TP_printk() format as well as its arguments
to make sure nothing may be dereferencing a pointer that was not copied
into the ring buffer along with the event. If it is, it will trigger a
WARN_ON().
For strings that use "%s", it is not so easy. The string may not reside in
the ring buffer but may still be valid. Strings that are static and part
of the kernel proper which will not be freed for the life of the running
system, are safe to dereference. But to know if it is a pointer to a
static string or to something on the heap can not be determined until the
event is triggered.
This brings us to the second function that tests for the bad dereferencing
of strings, trace_check_vprintf(). It would walk through the printf format
looking for "%s", and when it finds it, it would validate that the pointer
is safe to read. If not, it would produces a WARN_ON() as well and write
into the ring buffer "[UNSAFE-MEMORY]".
The problem with this is how it used va_list to have vsnprintf() handle
all the cases that it didn't need to check. Instead of re-implementing
vsnprintf(), it would make a copy of the format up to the %s part, and
call vsnprintf() with the current va_list ap variable, where the ap would
then be ready to point at the string in question.
For architectures that passed va_list by reference this was possible. For
architectures that passed it by copy it was not. A test_can_verify()
function was used to differentiate between the two, and if it wasn't
possible, it would disable it.
Even for architectures where this was feasible, it was a stretch to rely
on such a method that is undocumented, and could cause issues later on
with new optimizations of the compiler.
Instead, the first function test_event_printk() was updated to look at
"%s" as well. If the "%s" argument is a pointer outside the event in the
ring buffer, it would find the field type of the event that is the problem
and mark the structure with a new flag called "needs_test". The event
itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that
this event has a field that needs to be verified before the event can be
printed using the printf format.
When the event fields are created from the field type structure, the
fields would copy the field type's "needs_test" value.
Finally, before being printed, a new function ignore_event() is called
which will check if the event has the TEST_STR flag set (if not, it
returns false). If the flag is set, it then iterates through the events
fields looking for the ones that have the "needs_test" flag set.
Then it uses the offset field from the field structure to find the pointer
in the ring buffer event. It runs the tests to make sure that pointer is
safe to print and if not, it triggers the WARN_ON() and also adds to the
trace output that the event in question has an unsafe memory access.
The ignore_event() makes the trace_check_vprintf() obsolete so it is
removed.
Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8cd63406d08110c8098e1efda8aef7ddab4db348 upstream.
The TP_printk() of a TRACE_EVENT() is a generic printf format that any
developer can create for their event. It may include pointers to strings
and such. A boot mapped buffer may contain data from a previous kernel
where the strings addresses are different.
One solution is to copy the event content and update the pointers by the
recorded delta, but a simpler solution (for now) is to just use the
print_fields() function to print these events. The print_fields() function
just iterates the fields and prints them according to what type they are,
and ignores the TP_printk() format from the event itself.
To understand the difference, when printing via TP_printk() the output
looks like this:
4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache
But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields)
the same event output looks like this:
4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0)
4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home
Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6ce5a6f0a07d37cc377df08a8d8a9c283420f323 ]
The issue that unrelated function name is shown on stack trace like
following even though it should be trampoline code address is caused by
the creation of trampoline code in the area where .init.text section
of module was freed after module is loaded.
bash-1344 [002] ..... 43.644608: <stack trace>
=> (MODULE INIT FUNCTION)
=> vfs_write
=> ksys_write
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
To resolve this, when function address of stack trace entry is in
trampoline, output without looking up symbol name.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241021071454.34610-2-tatsuya.s2862@gmail.com
Signed-off-by: Tatsuya S <tatsuya.s2862@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring buffer fixes from Steven Rostedt:
- Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU
hotplug"
A crash that happened on cpu hotplug was actually caused by the
incorrect ref counting that was fixed by commit 2cf9733891a4
("ring-buffer: Fix refcount setting of boot mapped buffers"). The
removal of calling cpu hotplug callbacks on memory mapped buffers was
not an issue even though the tests at the time pointed toward it. But
in fact, there's a check in that code that tests to see if the
buffers are already allocated or not, and will not allocate them
again if they are. Not calling the cpu hotplug callbacks ended up not
initializing the non boot CPU buffers.
Simply remove that change.
- Clear all CPU buffers when starting tracing in a boot mapped buffer
To properly process events from a previous boot, the address space
needs to be accounted for due to KASLR and the events in the buffer
are updated accordingly when read. This also requires that when the
buffer has tracing enabled again in the current boot that the buffers
are reset so that events from the previous boot do not interact with
the events of the current boot and cause confusing due to not having
the proper meta data.
It was found that if a CPU is taken offline, that its per CPU buffer
is not reset when tracing starts. This allows for events to be from
both the previous boot and the current boot to be in the buffer at
the same time. Clear all CPU buffers when tracing is started in a
boot mapped buffer.
* tag 'trace-ringbuffer-v6.12-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recording
Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
|
|
The events of a memory mapped ring buffer from the previous boot should
not be mixed in with events from the current boot. There's meta data that
is used to handle KASLR so that function names can be shown properly.
Also, since the timestamps of the previous boot have no meaning to the
timestamps of the current boot, having them intermingled in a buffer can
also cause confusion because there could possibly be events in the future.
When a trace is activated the meta data is reset so that the pointers of
are now processed for the new address space. The trace buffers are reset
when tracing starts for the first time. The problem here is that the reset
only happens on online CPUs. If a CPU is offline, it does not get reset.
To demonstrate the issue, a previous boot had tracing enabled in the boot
mapped ring buffer on reboot. On the following boot, tracing has not been
started yet so the function trace from the previous boot is still visible.
# trace-cmd show -B boot_mapped -c 3 | tail
<idle>-0 [003] d.h2. 156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: __rcu_read_unlock <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: stop_this_cpu <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: set_cpu_online <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462397: disable_local_APIC <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462398: clear_local_APIC <-disable_local_APIC
<idle>-0 [003] d.h2. 156.462574: mcheck_cpu_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: mce_intel_feature_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: lmce_supported <-mce_intel_feature_clear
Now, if CPU 3 is taken offline, and tracing is started on the memory
mapped ring buffer, the events from the previous boot in the CPU 3 ring
buffer is not reset. Now those events are using the meta data from the
current boot and produces just hex values.
# echo 0 > /sys/devices/system/cpu/cpu3/online
# trace-cmd start -B boot_mapped -p function
# trace-cmd show -B boot_mapped -c 3 | tail
<idle>-0 [003] d.h2. 156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7
<idle>-0 [003] d.h2. 156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2
<idle>-0 [003] d.h2. 156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f
Reset all CPUs when starting a boot mapped ring buffer for the first time,
and not just the online CPUs.
Fixes: 7a1d1e4b9639f ("tracing/ring-buffer: Add last_boot_info file to boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Commit ee7f3666995d ("tracefs: Have new files inherit the ownership of
their parent") and commit 48b27b6b5191 ("tracefs: Set all files to the
same group ownership as the mount option") introduced a new gid mount
option that allows specifying a group to apply to all entries in tracefs.
Document this in the tracing readme.
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/20241030171928.4168869-3-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
A ring buffer which has its buffered mapped at boot up to fixed memory
should not be freed. Other buffers can be. The ref counting setup was
wrong for both. It made the not mapped buffers ref count have zero, and the
boot mapped buffer a ref count of 1. But an normally allocated buffer
should be 1, where it can be removed.
Keep the ref count of a normal boot buffer with its setup ref count (do
not decrement it), and increment the fixed memory boot mapped buffer's ref
count.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home
Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When the tp_printk kernel command line is used, the trace events go
directly to printk(). It is still checked via the trace_check_vprintf()
function to make sure the pointers of the trace event are legit.
The addition of reading buffers from previous boots required adding a
delta between the addresses of the previous boot and the current boot so
that the pointers in the old buffer can still be used. But this required
adding a trace_array pointer to acquire the delta offsets.
The tp_printk code does not provide a trace_array (tr) pointer, so when
the offsets were examined, a NULL pointer dereference happened and the
kernel crashed.
If the trace_array does not exist, just default the delta offsets to zero,
as that also means the trace event is not being read from a previous boot.
Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home
Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions")
Reported-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
no_llseek had been defined to NULL two years ago, in commit 868941b14441
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:
- tracing/ring-buffer: persistent buffer across reboots
This allows for the tracing instance ring buffer to stay persistent
across reboots. The way this is done is by adding to the kernel
command line:
trace_instance=boot_map@0x285400000:12M
This will reserve 12 megabytes at the address 0x285400000, and then
map the tracing instance "boot_map" ring buffer to that memory. This
will appear as a normal instance in the tracefs system:
/sys/kernel/tracing/instances/boot_map
A user could enable tracing in that instance, and on reboot or kernel
crash, if the memory is not wiped by the firmware, it will recreate
the trace in that instance. For example, if one was debugging a
shutdown of a kernel reboot:
# cd /sys/kernel/tracing
# echo function > instances/boot_map/current_tracer
# reboot
[..]
# cd /sys/kernel/tracing
# tail instances/boot_map/trace
swapper/0-1 [000] d..1. 164.549800: restore_boot_irq_mode <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549802: disconnect_bsp_APIC <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549811: hpet_disable <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549812: iommu_shutdown_noop <-native_machine_restart
swapper/0-1 [000] d..1. 164.549813: native_machine_emergency_restart <-__do_sys_reboot
swapper/0-1 [000] d..1. 164.549813: tboot_shutdown <-native_machine_emergency_restart
swapper/0-1 [000] d..1. 164.549820: acpi_reboot <-native_machine_emergency_restart
swapper/0-1 [000] d..1. 164.549821: acpi_reset <-acpi_reboot
swapper/0-1 [000] d..1. 164.549822: acpi_os_write_port <-acpi_reboot
On reboot, the buffer is examined to make sure it is valid. The
validation check even steps through every event to make sure the meta
data of the event is correct. If any test fails, it will simply reset
the buffer, and the buffer will be empty on boot.
- Allow the tracing persistent boot buffer to use the "reserve_mem"
option
Instead of having the admin find a physical address to store the
persistent buffer, which can be very tedious if they have to
administrate several different machines, allow them to use the
"reserve_mem" option that will find a location for them. It is not as
reliable because of KASLR, as the loading of the kernel in different
locations can cause the memory allocated to be inconsistent. Booting
with "nokaslr" can make reserve_mem more reliable.
- Have function graph tracer handle offsets from a previous boot.
The ring buffer output from a previous boot may have different
addresses due to kaslr. Have the function graph tracer handle these
by using the delta from the previous boot to the new boot address
space.
- Only reset the saved meta offset when the buffer is started or reset
In the persistent memory meta data, it holds the previous address
space information, so that it can calculate the delta to have
function tracing work. But this gets updated after being read to hold
the new address space. But if the buffer isn't used for that boot, on
reboot, the delta is now calculated from the previous boot and not
the boot that holds the data in the ring buffer. This causes the
functions not to be shown. Do not save the address space information
of the current kernel until it is being recorded.
- Add a magic variable to test the valid meta data
Add a magic variable in the meta data that can also be used for
validation. The validator of the previous buffer doesn't need this
magic data, but it can be used if the meta data is changed by a new
kernel, which may have the same format that passes the validator but
is used differently. This magic number can also be used as a
"versioning" of the meta data.
- Align user space mapped ring buffer sub buffers to improve TLB
entries
Linus mentioned that the mapped ring buffer sub buffers were
misaligned between the meta page and the sub-buffers, so that if the
sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to
use bigger entries.
- Add new kernel command line "traceoff" to disable tracing on boot for
instances
If tracing is enabled for a boot instance, there needs a way to be
able to disable it on boot so that new events do not get entered into
the ring buffer and be mixed with events from a previous boot, as
that can be confusing.
- Allow trace_printk() to go to other instances
Currently, trace_printk() can only go to the top level instance. When
debugging with a persistent buffer, it is really useful to be able to
add trace_printk() to go to that buffer, so that you have access to
them after a crash.
- Do not use "bin_printk()" for traces to a boot instance
The bin_printk() saves only a pointer to the printk format in the
ring buffer, as the reader of the buffer can still have access to it.
But this is not the case if the buffer is from a previous boot. If
the trace_printk() is going to a "persistent" buffer, it will use the
slower version that writes the printk format into the buffer.
- Add command line option to allow trace_printk() to go to an instance
Allow the kernel command line to define which instance the
trace_printk() goes to, instead of forcing the admin to set it for
every boot via the tracefs options.
- Start a document that explains how to use tracefs to debug the kernel
- Add some more kernel selftests to test user mapped ring buffer
* tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits)
selftests/ring-buffer: Handle meta-page bigger than the system
selftests/ring-buffer: Verify the entire meta-page padding
tracing/Documentation: Start a document on how to debug with tracing
tracing: Add option to set an instance to be the trace_printk destination
tracing: Have trace_printk not use binary prints if boot buffer
tracing: Allow trace_printk() to go to other instance buffers
tracing: Add "traceoff" flag to boot time tracing instances
ring-buffer: Align meta-page to sub-buffers for improved TLB usage
ring-buffer: Add magic and struct size to boot up meta data
ring-buffer: Don't reset persistent ring-buffer meta saved addresses
tracing/fgraph: Have fgraph handle previous boot function addresses
tracing: Allow boot instances to use reserve_mem boot memory
tracing: Fix ifdef of snapshots to not prevent last_boot_info file
ring-buffer: Use vma_pages() helper function
tracing: Fix NULL vs IS_ERR() check in enable_instances()
tracing: Add last boot delta offset for stack traces
tracing: Update function tracing output for previous boot buffer
tracing: Handle old buffer mappings for event strings and functions
tracing/ring-buffer: Add last_boot_info file to boot instance
ring-buffer: Save text and data locations in mapped meta data
...
|
|
A helper function defined but not used. This, in particular,
prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y:
kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function]
2229 | static inline int run_tracer_selftest(struct tracer *type)
| ^~~~~~~~~~~~~~~~~~~
Fix this by dropping unused functions.
See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).
Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.
Cc: stable@vger.kernel.org
Fixes: 2f26ebd549b9a ("tracing: use timestamp to determine start of latency traces")
Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a option "trace_printk_dest" that will make the tracing instance the
location that trace_printk() will go to. This is useful if the
trace_printk or one of the top level tracers is too noisy and there's a
need to separate the two. Then an instance can be created, the
trace_printk can be set to go there instead, where it will not be lost in
the noise of the top level tracer.
Note, only one instance can be the destination of trace_printk at a time.
If an instance sets this flag, the instance that had it set will have it
cleared. There is always one instance that has this set. By default, that
is the top instance. This flag cannot be cleared from the top instance.
Doing so will result in an -EINVAL. The only way this flag can be cleared
from the top instance is by another instance setting it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexander Aring <aahringo@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jonathan Corbet" <corbet@lwn.net>
Link: https://lore.kernel.org/20240823014019.545459018@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
If the persistent boot mapped ring buffer is used for trace_printk(),
force it to not use the binary versions. trace_printk() by default uses
bin_printf() that only saves the pointer to the format and not the format
itself inside the ring buffer. But for a persistent buffer that is read
after reboot, the pointers to the format strings may not be the same, or
worse, not even exist! Instead, just force the more robust, but slower,
version that does the formatting before saving into the ring buffer.
The boot mapped buffer can now be used for trace_printk and friends!
Using the trace_printk() and the persistent buffer was used to debug the
issue with the osnoise tracer:
Link: https://lore.kernel.org/all/20240822103443.6a6ae051@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexander Aring <aahringo@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jonathan Corbet" <corbet@lwn.net>
Link: https://lore.kernel.org/20240823014019.386925800@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Currently, trace_printk() just goes to the top level ring buffer. But
there may be times that it should go to one of the instances created by
the kernel command line.
Add a new trace_instance flag: traceprintk (also can use "printk" or
"trace_printk" as people tend to forget the actual flag name).
trace_instance=foo^traceprintk
Will assign the trace_printk to this buffer at boot up.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexander Aring <aahringo@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jonathan Corbet" <corbet@lwn.net>
Link: https://lore.kernel.org/20240823014019.226694946@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a "flags" delimiter (^) to the "trace_instance" kernel command line
parameter, and add the "traceoff" flag. The format is:
trace_instance=<name>[^<flag1>[^<flag2>]][@<memory>][,<events>]
The code allows for more than one flag to be added, but currently only
"traceoff" is done so.
The motivation for this change came from debugging with the persistent
ring buffer and having trace_printk() writing to it. The trace_printk
calls are always enabled, and the boot after the crash was having the
unwanted trace_printks from the current boot inject into the ring buffer
with the trace_printks of the crash kernel, making the output very
confusing.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexander Aring <aahringo@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jonathan Corbet" <corbet@lwn.net>
Link: https://lore.kernel.org/20240823014019.053229958@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Allow boot instances to use memory reserved by the reserve_mem boot
option.
reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace
The above will allocate 12 megs with 4096 alignment and label it "trace".
The second parameter will create a "boot_mapped" instance and use the
memory reserved and labeled as "trace" as the memory for the ring buffer.
That will create an instance called "boot_mapped":
/sys/kernel/tracing/instances/boot_mapped
Note, because the ring buffer is using a defined memory ranged, it will
act just like a memory mapped ring buffer. It will not have a snapshot
buffer, as it can't swap out the buffer. The snapshot files as well as any
tracers that uses a snapshot will not be present in the boot_mapped
instance.
Also note that reserve_mem is not reliable in acquiring the same physical
memory at each soft reboot. It is possible that KALSR could map the kernel
at the previous boot memory location forcing the reserve_mem to return a
different memory location. In this case, the previous ring buffer will be
lost.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20240815082811.669f7d8c@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The mapping of the ring buffer to memory allocated at boot up will also
expose a "last_boot_info" to help tooling to read the raw data from the
last boot. As instances that have their ring buffer mapped to fixed
memory cannot perform snapshots, they can either have the "snapshot" file
or the "last_boot_info" file, but not both.
The code that added the "last_boot_info" file failed to notice that the
"snapshot" creation was inside a "#ifdef CONFIG_TRACER_SNAPSHOT" and
incorrectly placed the creation of the "last_boot_info" file within the
ifdef block. Not only does it cause a warning when CONFIG_TRACER_SNAPSHOT
is not enabled, it also incorrectly prevents the file from appearing.
Link: https://lore.kernel.org/all/20240719102640.718554-1-arnd@kernel.org/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: Arnd Bergmann <arnd@kernel.org>
Link: https://lore.kernel.org/20240719101312.3d4ac707@rorschach.local.home
Fixes: 7a1d1e4b9639 ("tracing/ring-buffer: Add last_boot_info file to boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The "reserve_mem" kernel command line parameter has been pulled into
v6.11. Merge the latest -rc3 to allow the persistent ring buffer memory to
be able to be mapped at the address specified by the "reserve_mem" command
line parameter.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When running the following:
# cd /sys/kernel/tracing/
# echo 1 > events/sched/sched_waking/enable
# echo 1 > events/sched/sched_switch/enable
# echo 0 > tracing_on
# dd if=per_cpu/cpu0/trace_pipe_raw of=/tmp/raw0.dat
The dd task would get stuck in an infinite loop in the kernel. What would
happen is the following:
When ring_buffer_read_page() returns -1 (no data) then a check is made to
see if the buffer is empty (as happens when the page is not full), it will
call wait_on_pipe() to wait until the ring buffer has data. When it is it
will try again to read data (unless O_NONBLOCK is set).
The issue happens when there's a reader and the file descriptor is closed.
The wait_on_pipe() will return when that is the case. But this loop will
continue to try again and wait_on_pipe() will again return immediately and
the loop will continue and never stop.
Simply check if the file was closed before looping and exit out if it is.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240808235730.78bf63e5@rorschach.local.home
Fixes: 2aa043a55b9a7 ("tracing/ring-buffer: Fix wait_on_pipe() race")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
|
|
The trace_array_create_systems() function returns error pointers, not
NULL. Fix the check to match.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Link: https://lore.kernel.org/9b23ea03-d709-435f-a309-461c3d747457@moroto.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Use the saved text_delta and data_delta of a persistent memory mapped ring
buffer that was saved from a previous boot, and use the delta in the trace
event print output so that strings and functions show up normally.
That is, for an event like trace_kmalloc() that prints the callsite via
"%pS", if it used the address saved in the ring buffer it will not match
the function that was saved in the previous boot if the kernel remaps
itself between boots.
For RCU events that point to saved static strings where only the address
of the string is saved in the ring buffer, it too will be adjusted to
point to where the string is on the current boot.
Link: https://lkml.kernel.org/r/20240612232026.821020753@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
If an instance is mapped to memory on boot up, create a new file called
"last_boot_info" that will hold information that can be used to properly
parse the raw data in the ring buffer.
It will export the delta of the addresses for text and data from what it
was from the last boot. It does not expose actually addresses (unless you
knew what the actual address was from the last boot).
The output will look like:
# cat last_boot_info
text delta: -268435456
data delta: -268435456
The text and data are kept separate in case they are ever made different.
Link: https://lkml.kernel.org/r/20240612232026.658680738@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add an option to the trace_instance kernel command line parameter that
allows it to use the reserved memory from memmap boot parameter.
memmap=12M$0x284500000 trace_instance=boot_mapped@0x284500000:12M
The above will reserves 12 megs at the physical address 0x284500000.
The second parameter will create a "boot_mapped" instance and use the
memory reserved as the memory for the ring buffer.
That will create an instance called "boot_mapped":
/sys/kernel/tracing/instances/boot_mapped
Note, because the ring buffer is using a defined memory ranged, it will
act just like a memory mapped ring buffer. It will not have a snapshot
buffer, as it can't swap out the buffer. The snapshot files as well as any
tracers that uses a snapshot will not be present in the boot_mapped
instance.
Link: https://lkml.kernel.org/r/20240612232026.329660169@goodmis.org
Cc: linux-mm@kvack.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a buffer_meta per-cpu file for the trace instance that is mapped to
boot memory. This shows the current meta-data and can be used by user
space tools to record off the current mappings to help reconstruct the
ring buffer after a reboot.
It does not expose any virtual addresses, just indexes into the sub-buffer
pages.
Link: https://lkml.kernel.org/r/20240612232025.854471446@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Allow for creating a new instance by passing in an address and size to map
the ring buffer for the instance to.
This will allow features like a pstore memory mapped region to be used for
an tracing instance ring buffer that can be retrieved from one boot to the
next.
Link: https://lkml.kernel.org/r/20240612232025.692086240@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineeth Pillai <vineeth@bitbyteword.org>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing ring buffer updates from Steven Rostedt:
"Add ring_buffer memory mappings.
The tracing ring buffer was created based on being mostly used with
the splice system call. It is broken up into page ordered sub-buffers
and the reader swaps a new sub-buffer with an existing sub-buffer
that's part of the write buffer. It then has total access to the
swapped out sub-buffer and can do copyless movements of the memory
into other mediums (file system, network, etc).
The buffer is great for passing around the ring buffer contents in the
kernel, but is not so good for when the consumer is the user space
task itself.
A new interface is added that allows user space to memory map the ring
buffer. It will get all the write sub-buffers as well as reader
sub-buffer (that is not written to). It can send an ioctl to change
which sub-buffer is the new reader sub-buffer.
The ring buffer is read only to user space. It only needs to call the
ioctl when it is finished with a sub-buffer and needs a new sub-buffer
that the writer will not write over.
A self test program was also created for testing and can be used as an
example for the interface to user space. The libtracefs (external to
the kernel) also has code that interacts with this, although it is
disabled until the interface is in a official release. It can be
enabled by compiling the library with a special flag. This was used
for testing applications that perform better with the buffer being
mapped.
Memory mapped buffers have limitations. The main one is that it can
not be used with the snapshot logic. If the buffer is mapped,
snapshots will be disabled. If any logic is set to trigger snapshots
on a buffer, that buffer will not be allowed to be mapped"
* tag 'trace-ringbuffer-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Add cast to unsigned long addr passed to virt_to_page()
ring-buffer: Have mmapped ring buffer keep track of missed events
ring-buffer/selftest: Add ring-buffer mapping test
Documentation: tracing: Add ring-buffer mapping
tracing: Allow user-space mapping of the ring-buffer
ring-buffer: Introducing ring-buffer mapping functions
ring-buffer: Allocate sub-buffers with __GFP_COMP
|
|
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.
A solution for that problem is to let the user-space map the ring-buffer
directly.
The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:
* Meta-page -- include/uapi/linux/trace_mmap.h for a description
* Subbuf ID 0
* Subbuf ID 1
...
It is therefore easy to translate a subbuf ID into an offset in the
mapping:
reader_id = meta->reader->id;
reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.
Mapping will prevent snapshot and buffer size modifications.
Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-4-vdonnefort@google.com
CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
As like '%pd' type, this patch supports print type '%pD' for print file's
name. For example "name=$arg1:%pD" casts the `$arg1` as (struct file*),
dereferences the "file.f_path.dentry.d_name.name" field and stores it to
"name" argument as a kernel string.
Here is an example:
[tracing]# echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_event
[tracing]# echo 1 > events/kprobes/testprobe/enable
[tracing]# grep -q "1" events/kprobes/testprobe/enable
[tracing]# echo 0 > events/kprobes/testprobe/enable
[tracing]# grep "vfs_read" trace | grep "enable"
grep-15108 [003] ..... 5228.328609: testprobe: (vfs_read+0x4/0xbb0) name="enable"
Note that this expects the given argument (e.g. $arg1) is an address of struct
file. User must ensure it.
Link: https://lore.kernel.org/all/20240322064308.284457-3-yebin10@huawei.com/
[Masami: replaced "previous patch" with '%pd' type]
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
During fault locating, the file name needs to be printed based on the
dentry address. The offset needs to be calculated each time, which
is troublesome. Similar to printk, kprobe support print type '%pd' for
print dentry's name. For example "name=$arg1:%pd" casts the `$arg1`
as (struct dentry *), dereferences the "d_name.name" field and stores
it to "name" argument as a kernel string.
Here is an example:
[tracing]# echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events
[tracing]# echo 1 > events/kprobes/testprobe/enable
[tracing]# grep -q "1" events/kprobes/testprobe/enable
[tracing]# echo 0 > events/kprobes/testprobe/enable
[tracing]# cat trace | grep "enable"
bash-14844 [002] ..... 16912.889543: testprobe: (dput+0x4/0x30) name="enable"
grep-15389 [003] ..... 16922.834182: testprobe: (dput+0x4/0x30) name="enable"
grep-15389 [003] ..... 16922.836103: testprobe: (dput+0x4/0x30) name="enable"
bash-14844 [001] ..... 16931.820909: testprobe: (dput+0x4/0x30) name="enable"
Note that this expects the given argument (e.g. $arg1) is an address of struct
dentry. User must ensure it.
Link: https://lore.kernel.org/all/20240322064308.284457-2-yebin10@huawei.com/
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Currently ftrace only dumps the global trace buffer on an OOPs. For
debugging a production usecase, instance trace will be helpful to
check specific problems since global trace buffer may be used for
other purposes.
This patch extend the ftrace_dump_on_oops parameter to dump a specific
or multiple trace instances:
- ftrace_dump_on_oops=0: as before -- don't dump
- ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer
on all CPUs
- ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global
trace buffer on CPU that triggered the oops
- ftrace_dump_on_oops=<instance_name>: new behavior -- dump the
tracing instance matching <instance_name>
- ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu],
<instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace
buffer and multiple instance buffer on all CPUs, or only dump on CPU
that triggered the oops if =2 or =orig_cpu is given
Also, the sysctl node can handle the input accordingly.
Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com
Cc: Ross Zwisler <zwisler@google.com>
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mcgrof@kernel.org>
Cc: <keescook@chromium.org>
Cc: <j.granados@samsung.com>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <corbet@lwn.net>
Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Running the ftrace selftests caused the ring buffer mapping test to fail.
Investigating, I found that the snapshot counter would be incremented
every time a tracer that uses the snapshot is enabled even if the snapshot
was used by the previous tracer.
That is:
# cd /sys/kernel/tracing
# echo wakeup_rt > current_tracer
# echo wakeup_dl > current_tracer
# echo nop > current_tracer
would leave the snapshot counter at 1 and not zero. That's because the
enabling of wakeup_dl would increment the counter again but the setting
the tracer to nop would only decrement it once.
Do not arm the snapshot for a tracer if the previous tracer already had it
armed.
Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.570525723@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Instead of using UTS_RELEASE, use init_utsname()->release, which means that
we don't need to rebuild the code just for the git head commit changing.
Link: https://lore.kernel.org/linux-trace-kernel/20240222124639.65629-1-john.g.garry@oracle.com
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.
Instead of relying on the global trace_types_lock, a new spinlock is
introduced to serialize accesses to trace_array->snapshot. This intends
to allow access to that variable in a context where the mmap lock is
already held.
Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-4-vdonnefort@google.com
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The code that handles saved_cmdlines is split between the trace.c file and
the trace_sched_switch.c. There's some history to this. The
trace_sched_switch.c was originally created to handle the sched_switch
tracer that was deprecated due to sched_switch trace event making it
obsolete. But that file did not get deleted as it had some code to help
with saved_cmdlines. But trace.c has grown tremendously since then. Just
move all the saved_cmdlines code into trace_sched_switch.c as that's the
only reason that file still exists, and trace.c has gotten too big.
No functional changes.
Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.497966629@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In preparation of moving the saved_cmdlines logic out of trace.c and into
trace_sched_switch.c, replace the open coded manipulation of tgid_map in
set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it
can be easily moved into trace_sched_switch.c without changing existing
functions in trace.c.
No functional changes.
Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.338116216@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The saved_cmdlines have three arrays for mapping PIDs to COMMs:
- map_pid_to_cmdline[]
- map_cmdline_to_pid[]
- saved_cmdlines
The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index
into the other arrays. The map_cmdline_to_pid[] is a mapping back to the
full pid as it can be larger than PID_MAX_DEFAULT. And the
saved_cmdlines[] just holds the COMMs associated to the pids.
Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated
together (in reality the saved_cmdlines is just in the memory of the
rounding of the allocation of the structure as it is always allocated in
powers of two). The map_cmdline_to_pid[] array is allocated separately.
Since the rounding to a power of two is rather large (it allows for 8000
elements in saved_cmdlines), also include the map_cmdline_to_pid[] array.
(This drops it to 6000 by default, which is still plenty for most use
cases). This saves even more memory as the map_cmdline_to_pid[] array
doesn't need to be allocated.
Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Do not update shortest_full in rb_watermark_hit() if the watermark is
hit. The shortest_full field was being updated regardless if the task
was going to wait or not. If the watermark is hit, then the task is
not going to wait, so do not update the shortest_full field (used by
the waker).
- Update shortest_full field before setting the full_waiters_pending
flag
In the poll logic, the full_waiters_pending flag was being set before
the shortest_full field was set. If the full_waiters_pending flag is
set, writers will check the shortest_full field which has the least
percentage of data that the ring buffer needs to be filled before
waking up. The writer will check shortest_full if
full_waiters_pending is set, and if the ring buffer percentage filled
is greater than shortest full, then it will call the irq_work to wake
up the waiters.
The problem was that the poll logic set the full_waiters_pending flag
before updating shortest_full, which when zero will always trigger
the writer to call the irq_work to wake up the waiters. The irq_work
will reset the shortest_full field back to zero as the woken waiters
is suppose to reset it.
- There's some optimized logic in the rb_watermark_hit() that is used
in ring_buffer_wait(). Use that helper function in the poll logic as
well.
- Restructure ring_buffer_wait() to use wait_event_interruptible()
The logic to wake up pending readers when the file descriptor is
closed is racy. Restructure ring_buffer_wait() to allow callers to
pass in conditions besides the ring buffer having enough data in it
by using wait_event_interruptible().
- Update the tracing_wait_on_pipe() to call ring_buffer_wait() with its
own conditions to exit the wait loop.
* tag 'trace-ring-buffer-v6.8-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/ring-buffer: Fix wait_on_pipe() race
ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
ring-buffer: Reuse rb_watermark_hit() for the poll logic
ring-buffer: Fix full_waiters_pending in poll
ring-buffer: Do not set shortest_full when full target is hit
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
"x86 kprobes:
- Use boolean for some function return instead of 0 and 1
- Prohibit probing on INT/UD. This prevents user to put kprobe on
INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a
special purpose in the kernel
- Boost Grp instructions. Because a few percent of kernel
instructions are Grp 2/3/4/5 and those are safe to be executed
without ip register fixup, allow those to be boosted (direct
execution on the trampoline buffer with a JMP)
tracing:
- Add function argument access from return events (kretprobe and
fprobe). This allows user to compare how a data structure field is
changed after executing a function. With BTF, return event also
accepts function argument access by name.
- Fix a wrong comment (using "Kretprobe" in fprobe)
- Cleanup a big probe argument parser function into three parts, type
parser, post-processing function, and main parser
- Cleanup to set nr_args field when initializing trace_probe instead
of counting up it while parsing
- Cleanup a redundant #else block from tracefs/README source code
- Update selftests to check entry argument access from return probes
- Documentation update about entry argument access from return
probes"
* tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
Documentation: tracing: Add entry argument access at function exit
selftests/ftrace: Add test cases for entry args at function exit
tracing/probes: Support $argN in return probe (kprobe and fprobe)
tracing: Remove redundant #else block for BTF args from README
tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init
tracing/probes: Cleanup probe argument parser
tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe event
x86/kprobes: Boost more instructions from grp2/3/4/5
x86/kprobes: Prohibit kprobing on INT and UD
x86/kprobes: Refactor can_{probe,boost} return type to bool
|
|
When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.
CPU 0 CPU 1
----- -----
wait_index++;
index = wait_index;
ring_buffer_wake_waiters();
wait_on_pipe()
ring_buffer_wait();
The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:
prepare_to_wait();
if (!condition)
schedule();
Where the missing condition check is the iter->wait_index update.
Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.
In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.
Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.
Have the wait_on_pipe() condition callback also check the closed field.
Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.
Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950713@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The .release() function does not get called until all readers of a file
descriptor are finished.
If a thread is blocked on reading a file descriptor in ring_buffer_wait(),
and another thread closes the file descriptor, it will not wake up the
other thread as ring_buffer_wake_waiters() is called by .release(), and
that will not get called until the .read() is finished.
The issue originally showed up in trace-cmd, but the readers are actually
other processes with their own file descriptors. So calling close() would wake
up the other tasks because they are blocked on another descriptor then the
one that was closed(). But there's other wake ups that solve that issue.
When a thread is blocked on a read, it can still hang even when another
thread closed its descriptor.
This is what the .flush() callback is for. Have the .flush() wake up the
readers.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|