本文简单介绍 Ftrace 相关的内容, 然后给出几个实用的例子.
关键概念
- Ftrace 官方是首字母大写 Ftrace;
- Ftrace 的 Kernel 文档: https://www.kernel.org/doc/html/latest/trace/ftrace.html
- Ftrace 不是一个简单的工具/命令, 而是一个观测内核的框架. 它可以用来 debug/分析延迟/profiling 内核;
- 它的延迟分析能够分析在 interrupt 关闭/打开, 抢占关闭/打开的情况下的延迟;
- 内核提供了上百个事件(event)的观测, 通过配置 tracefs, 可以通过 Ftrace 看到这些 event 的情况;
kernel 支持的 tracer 以及配置 tracer
pi@raspberrypi:~/tmp $ sudo cat /sys/kernel/tracing/available_tracers
blk function_graph wakeup_dl wakeup_rt wakeup irqsoff function nop
sudo echo function_graph > /sys/kernel/tracing/current_tracer
tracefs
Ftrace 使用 tracefs 配置
和产生输出
, tracefs 是一个虚拟的文件系统, 用来与 kernel 提供的tracing 框架交互. 它能够动态的打开/关闭某个tracing feature. tracefs 提供一个层级路径结构, 不同的子目录对应不同的 tracing 事件. 它的主路径在 /sys/kernel/tracing
.
通过 tracefs 我们可以:
- 控制 tracing events: 打开或关闭一个/一组events. 如: function call, context switch, memory allocation
- 控制某些选项: 如 trace buffer size, trace data 格式, event handler 的行为
- 配置数据搜集文件: 如你可以配置trace data 从专门文件读,
trace
, trace_pipe
, trace_marker
. - 过滤: 如 过滤特定的线程, 进程, event type.
几个例子
确保 tracefs 被挂载
$ ls /sys/kernel/tracing
# 若没挂载, 使用下面命令挂载 tracefs.
$ sudo mount -t tracefs nodev /sys/kernel/tracing
trace function uptime_proc_show
的例子
# 查看所有可用的 functions
$ cat available_filter_functions
# 写入要 trace 的函数
$ echo uptime_proc_show > /sys/kernel/tracing/set_ftrace_filter
# trace function
$ echo function > /sys/kernel/tracing/current_tracer
# 打开 trace
$ echo 1 > /sys/kernel/tracing/tracing_on
# 做一些有写入操作的动作
$ uptime
# 关闭 tracing
$ echo 0 > /sys/kernel/tracing/tracing_on
# 查看 tracing 数据
$ cat /sys/kernel/tracing/trace
# tracer: function
#
# entries-in-buffer/entries-written: 1/1 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
uptime-39274 [005] ..... 46635.054971: uptime_proc_show <-seq_read_iter
# trace 改成 nop
$ echo nop > /sys/kernel/tracing/current_tracer
trace function uptime_proc_show
并显示 stack trace 的例子
根上面的例子一样, 只是在
$ echo uptime_proc_show > /sys/kernel/tracing/set_ftrace_filter
$ echo function > /sys/kernel/tracing/current_tracer
$ echo 1 > /sys/kernel/tracing/options/func_stack_trace
$ echo 1 > /sys/kernel/tracing/tracing_on
$ uptime
$ uptime
$ echo 0 > /sys/kernel/tracing/tracing_on
$ cat /sys/kernel/tracing/trace
# tracer: function
#
# entries-in-buffer/entries-written: 4/4 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
uptime-39283 [005] ..... 47476.243887: uptime_proc_show <-seq_read_iter
uptime-39283 [005] ..... 47476.243892: <stack trace>
=> uptime_proc_show
=> seq_read_iter
=> proc_reg_read_iter
=> new_sync_read
=> vfs_read
=> ksys_read
=> __x64_sys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
uptime-39284 [005] ..... 47480.208731: uptime_proc_show <-seq_read_iter
uptime-39284 [005] ..... 47480.208736: <stack trace>
=> uptime_proc_show
=> seq_read_iter
=> proc_reg_read_iter
=> new_sync_read
=> vfs_read
=> ksys_read
=> __x64_sys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
# 清理
$ echo 0 > /sys/kernel/tracing/options/func_stack_trace
$ echo 0 > /sys/kernel/tracing/tracing_on
$ echo nop > /sys/kernel/tracing/current_tracer
$ echo "" > /sys/kernel/tracing/trace
function_graph tracer 的例子
$ echo uptime_proc_show > set_graph_function
$ echo function_graph > current_tracer
$ echo 1 > tracing_on
$ uptime
$ echo 0 > tracing_on
$ cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) | uptime_proc_show() {
0) | get_idle_time() {
0) | get_cpu_idle_time_us() {
0) 0.230 us | ktime_get();
0) 0.788 us | }
0) 1.231 us | }
0) | get_idle_time() {
0) | get_cpu_idle_time_us() {
0) 0.125 us | ktime_get();
0) 0.177 us | nr_iowait_cpu();
0) 0.742 us | }
0) 0.964 us | }
... 省略 一些 重复 ...
0) 0.197 us | ktime_get_with_offset();
0) 0.107 us | ns_to_timespec64();
0) 0.118 us | set_normalized_timespec64();
0) 1.007 us | seq_printf();
0) + 30.206 us | }
# 清理
$ echo nop > current_tracer
$ echo "" > set_graph_function
function_graph tracer 并过滤一些 function 的例子
$ echo uptime_proc_show > set_graph_function
$ echo get_idle_time > set_graph_notrace
$ echo function_graph > current_tracer
$ echo 1 > tracing_on
$ uptime
$ echo 0 > tracing_on
$ cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) | uptime_proc_show() {
0) 0.371 us | ktime_get_with_offset();
0) 0.152 us | ns_to_timespec64();
0) 0.138 us | set_normalized_timespec64();
0) 1.027 us | seq_printf();
0) 8.900 us | }
# 清理
$ echo nop > current_tracer
$ echo "" > set_graph_function
$ echo "" > set_graph_notrace
trace event 的例子
显示可用的 trace events:
$ cat /sys/kernel/tracing/available_events
$ ls /sys/kernel/tracing/events/
以 events/syscalls/sys_enter_write
为例展示 trace event
$ ls /sys/kernel/tracing/events/syscalls/sys_enter_write
enable filter format hist id inject trigger
# 查看输出格式
$ cat /sys/kernel/tracing/events/syscalls/sys_enter_write/format
# enable event tracing for that event
$ echo 1 > /sys/kernel/tracing/events/syscalls/sys_enter_write/enable
$ echo 1 > /sys/kernel/tracing/tracing_on
# 查看输出
$ cat /sys/kernel/tracing/trace
cat-39448 [005] ..... 49432.357354: sys_write(fd: 1, buf: 7f8e06c19000, count: 62)
cat-39448 [005] ..... 49432.357304: sys_write(fd: 1, buf: 7f8e06c19000, count: 62)
# 清理
$ echo 0 > /sys/kernel/tracing/tracing_on
$ echo 0 > /sys/kernel/tracing/events/syscalls/sys_enter_write/enable
设置 event 过滤条件
过滤 fd 不是 1 的(标准输出 stdout):
$ cat /sys/kernel/tracing/events/syscalls/sys_enter_write/filter
none
$ echo "fd!=1" > /sys/kernel/tracing/events/syscalls/sys_enter_write/filter
$ echo 1 > /sys/kernel/tracing/events/syscalls/sys_enter_write/enable
$ echo 1 > /sys/kernel/tracing/tracing_on
$ cat /sys/kernel/tracing/trace
sudo-3129 [003] ..... 51059.260638: sys_write(fd: 8, buf: 560f1fb345e0, count: fff)
# 清理
$ echo 0 > /sys/kernel/tracing/tracing_on
$ echo 0 > /sys/kernel/tracing/events/syscalls/sys_enter_write/enable
$ echo "" > /sys/kernel/tracing/trace
event 的 histogram
histogram 通过 trigger
实现
# 查看 trigger
cat /sys/kernel/tracing/events/events/syscalls/sys_enter_write/trigger
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
$ echo "hist:key=common_pid.execname:val=count:sort=count.descending" > /sys/kernel/tracing/events/syscalls/sys_enter_write/trigger
$ echo "fd!=1" > /sys/kernel/tracing/events/syscalls/sys_enter_write/filter
$ echo 1 > /sys/kernel/tracing/events/syscalls/sys_enter_write/enable
$ echo 1 > /sys/kernel/tracing/tracing_on
$ cat /sys/kernel/tracing/trace
$ cat /sys/kernel/tracing/events/syscalls/sys_enter_write/hist
# event histogram
#
# trigger info: hist:keys=common_pid.execname:vals=hitcount,count:sort=count.descending:size=2048 [active]
#
{ common_pid: sshd [ 1036] } hitcount: 169 count: 13000
{ common_pid: sudo [ 3129] } hitcount: 169 count: 9678
{ common_pid: cat [ 39478] } hitcount: 2 count: 7893
{ common_pid: bash [ 3132] } hitcount: 108 count: 1371
{ common_pid: multipathd [ 497] } hitcount: 122 count: 976
{ common_pid: cat [ 39479] } hitcount: 4 count: 37
{ common_pid: upowerd [ 2616] } hitcount: 1 count: 8
Totals:
Hits: 575
Entries: 7
Dropped: 0
# 清理
$ echo 0 > /sys/kernel/tracing/tracing_on
$ echo 0 > /sys/kernel/tracing/events/syscalls/sys_enter_write/enable
$ echo "" > /sys/kernel/tracing/trace
Ftrace filter
多个 filter
$ echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
$ cat set_ftrace_filter
hrtimer_interrupt
sys_nanosleep
模糊匹配 模式 使用*. 前面,后面,中间等
$ echo 'hrtimer_*' > set_ftrace_filter
$ cat set_ftrace_filter
添加更多 filter
$ echo sys_nanosleep >> set_ftrace_filter
使用文件 available_filter_functions
里面的索引值
$ head -1 available_filter_functions
$ echo 1 > set_ftrace_filter
$ cat set_ftrace_filter
写文件到文件系统, 并抓取 ksys_write
的 function call:
写一个shell 文件 test.sh
echo $$
sleep 5
echo "" > /sys/kernel/tracing/trace
echo ksys_write > /sys/kernel/tracing/set_graph_function
echo $$ > /sys/kernel/tracing/set_ftrace_pid
echo function_graph > /sys/kernel/tracing/current_tracer
echo 1 > /sys/kernel/tracing/tracing_on
echo $$ > /tmp/test.txt
echo 0 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace > /tmp/result.txt
echo nop > /sys/kernel/tracing/current_tracer
echo "" > /sys/kernel/tracing/trace
echo "done"
cat /tmp/result.txt
其它
以下都是材料, 还没整理:
- 这应该是比较早的关于 ftrace 的邮件: https://lwn.net/Articles/264029/
- 关于 ftrace 的一个配置的说明: https://cateee.net/lkddb/web-lkddb/FTRACE.html
Enable the kernel to trace every kernel function. This is done by
using a compiler feature to insert a small, 5-byte No-Operation
instruction to the beginning of every kernel function, which NOP
sequence is then dynamically patched into a tracer call when tracing
is enabled by the administrator. If it's runtime disabled (the bootup
default), then the overhead of the instructions is very small and not
measurable even in micro-benchmarks.
3.https://alex.dzyoba.com/blog/ftrace/#:~:text=Ftrace%20is%20a%20framework%20for,Tracepoints%20support
关于 ftrace 的一些关键字:
框架, NOP, -pg, gcc, gprof, trace-cmd, kernelShark, tracefs, 文件接口, mcount, 所有 kernel 函数, function, function_graph.
how it works
Ftrace -> mcount -> gpro -> (-pg flags)
https://github.com/freelancer-leon/notes/blob/master/kernel/trace/ftrace-design.md
https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_node/gprof_25.html
参考: https://www.kernel.org/doc/html/latest/trace/ftrace.html