1 ftrace - Function Tracer 2 ======================== 3 4Copyright 2008 Red Hat Inc. 5 Author: Steven Rostedt <srostedt@redhat.com> 6 License: The GNU Free Documentation License, Version 1.2 7 (dual licensed under the GPL v2) 8Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, 9 John Kacur, and David Teigland. 10Written for: 2.6.28-rc2 11 12Introduction 13------------ 14 15Ftrace is an internal tracer designed to help out developers and 16designers of systems to find what is going on inside the kernel. 17It can be used for debugging or analyzing latencies and 18performance issues that take place outside of user-space. 19 20Although ftrace is the function tracer, it also includes an 21infrastructure that allows for other types of tracing. Some of 22the tracers that are currently in ftrace include a tracer to 23trace context switches, the time it takes for a high priority 24task to run after it was woken up, the time interrupts are 25disabled, and more (ftrace allows for tracer plugins, which 26means that the list of tracers can always grow). 27 28 29Implementation Details 30---------------------- 31 32See ftrace-design.txt for details for arch porters and such. 33 34 35The File System 36--------------- 37 38Ftrace uses the debugfs file system to hold the control files as 39well as the files to display output. 40 41When debugfs is configured into the kernel (which selecting any ftrace 42option will do) the directory /sys/kernel/debug will be created. To mount 43this directory, you can add to your /etc/fstab file: 44 45 debugfs /sys/kernel/debug debugfs defaults 0 0 46 47Or you can mount it at run time with: 48 49 mount -t debugfs nodev /sys/kernel/debug 50 51For quicker access to that directory you may want to make a soft link to 52it: 53 54 ln -s /sys/kernel/debug /debug 55 56Any selected ftrace option will also create a directory called tracing 57within the debugfs. The rest of the document will assume that you are in 58the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate 59on the files within that directory and not distract from the content with 60the extended "/sys/kernel/debug/tracing" path name. 61 62That's it! (assuming that you have ftrace configured into your kernel) 63 64After mounting the debugfs, you can see a directory called 65"tracing". This directory contains the control and output files 66of ftrace. Here is a list of some of the key files: 67 68 69 Note: all time values are in microseconds. 70 71 current_tracer: 72 73 This is used to set or display the current tracer 74 that is configured. 75 76 available_tracers: 77 78 This holds the different types of tracers that 79 have been compiled into the kernel. The 80 tracers listed here can be configured by 81 echoing their name into current_tracer. 82 83 tracing_on: 84 85 This sets or displays whether writing to the trace 86 ring buffer is enabled. Echo 0 into this file to disable 87 the tracer or 1 to enable it. 88 89 trace: 90 91 This file holds the output of the trace in a human 92 readable format (described below). 93 94 trace_pipe: 95 96 The output is the same as the "trace" file but this 97 file is meant to be streamed with live tracing. 98 Reads from this file will block until new data is 99 retrieved. Unlike the "trace" file, this file is a 100 consumer. This means reading from this file causes 101 sequential reads to display more current data. Once 102 data is read from this file, it is consumed, and 103 will not be read again with a sequential read. The 104 "trace" file is static, and if the tracer is not 105 adding more data,they will display the same 106 information every time they are read. 107 108 trace_options: 109 110 This file lets the user control the amount of data 111 that is displayed in one of the above output 112 files. 113 114 tracing_max_latency: 115 116 Some of the tracers record the max latency. 117 For example, the time interrupts are disabled. 118 This time is saved in this file. The max trace 119 will also be stored, and displayed by "trace". 120 A new max trace will only be recorded if the 121 latency is greater than the value in this 122 file. (in microseconds) 123 124 buffer_size_kb: 125 126 This sets or displays the number of kilobytes each CPU 127 buffer can hold. The tracer buffers are the same size 128 for each CPU. The displayed number is the size of the 129 CPU buffer and not total size of all buffers. The 130 trace buffers are allocated in pages (blocks of memory 131 that the kernel uses for allocation, usually 4 KB in size). 132 If the last page allocated has room for more bytes 133 than requested, the rest of the page will be used, 134 making the actual allocation bigger than requested. 135 ( Note, the size may not be a multiple of the page size 136 due to buffer management overhead. ) 137 138 This can only be updated when the current_tracer 139 is set to "nop". 140 141 tracing_cpumask: 142 143 This is a mask that lets the user only trace 144 on specified CPUS. The format is a hex string 145 representing the CPUS. 146 147 set_ftrace_filter: 148 149 When dynamic ftrace is configured in (see the 150 section below "dynamic ftrace"), the code is dynamically 151 modified (code text rewrite) to disable calling of the 152 function profiler (mcount). This lets tracing be configured 153 in with practically no overhead in performance. This also 154 has a side effect of enabling or disabling specific functions 155 to be traced. Echoing names of functions into this file 156 will limit the trace to only those functions. 157 158 This interface also allows for commands to be used. See the 159 "Filter commands" section for more details. 160 161 set_ftrace_notrace: 162 163 This has an effect opposite to that of 164 set_ftrace_filter. Any function that is added here will not 165 be traced. If a function exists in both set_ftrace_filter 166 and set_ftrace_notrace, the function will _not_ be traced. 167 168 set_ftrace_pid: 169 170 Have the function tracer only trace a single thread. 171 172 set_graph_function: 173 174 Set a "trigger" function where tracing should start 175 with the function graph tracer (See the section 176 "dynamic ftrace" for more details). 177 178 available_filter_functions: 179 180 This lists the functions that ftrace 181 has processed and can trace. These are the function 182 names that you can pass to "set_ftrace_filter" or 183 "set_ftrace_notrace". (See the section "dynamic ftrace" 184 below for more details.) 185 186 187The Tracers 188----------- 189 190Here is the list of current tracers that may be configured. 191 192 "function" 193 194 Function call tracer to trace all kernel functions. 195 196 "function_graph" 197 198 Similar to the function tracer except that the 199 function tracer probes the functions on their entry 200 whereas the function graph tracer traces on both entry 201 and exit of the functions. It then provides the ability 202 to draw a graph of function calls similar to C code 203 source. 204 205 "irqsoff" 206 207 Traces the areas that disable interrupts and saves 208 the trace with the longest max latency. 209 See tracing_max_latency. When a new max is recorded, 210 it replaces the old trace. It is best to view this 211 trace with the latency-format option enabled. 212 213 "preemptoff" 214 215 Similar to irqsoff but traces and records the amount of 216 time for which preemption is disabled. 217 218 "preemptirqsoff" 219 220 Similar to irqsoff and preemptoff, but traces and 221 records the largest time for which irqs and/or preemption 222 is disabled. 223 224 "wakeup" 225 226 Traces and records the max latency that it takes for 227 the highest priority task to get scheduled after 228 it has been woken up. 229 230 "hw-branch-tracer" 231 232 Uses the BTS CPU feature on x86 CPUs to traces all 233 branches executed. 234 235 "nop" 236 237 This is the "trace nothing" tracer. To remove all 238 tracers from tracing simply echo "nop" into 239 current_tracer. 240 241 242Examples of using the tracer 243---------------------------- 244 245Here are typical examples of using the tracers when controlling 246them only with the debugfs interface (without using any 247user-land utilities). 248 249Output format: 250-------------- 251 252Here is an example of the output format of the file "trace" 253 254 -------- 255# tracer: function 256# 257# TASK-PID CPU# TIMESTAMP FUNCTION 258# | | | | | 259 bash-4251 [01] 10152.583854: path_put <-path_walk 260 bash-4251 [01] 10152.583855: dput <-path_put 261 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput 262 -------- 263 264A header is printed with the tracer name that is represented by 265the trace. In this case the tracer is "function". Then a header 266showing the format. Task name "bash", the task PID "4251", the 267CPU that it was running on "01", the timestamp in <secs>.<usecs> 268format, the function name that was traced "path_put" and the 269parent function that called this function "path_walk". The 270timestamp is the time at which the function was entered. 271 272Latency trace format 273-------------------- 274 275When the latency-format option is enabled, the trace file gives 276somewhat more information to see why a latency happened. 277Here is a typical trace. 278 279# tracer: irqsoff 280# 281irqsoff latency trace v1.1.5 on 2.6.26-rc8 282-------------------------------------------------------------------- 283 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 284 ----------------- 285 | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) 286 ----------------- 287 => started at: apic_timer_interrupt 288 => ended at: do_softirq 289 290# _------=> CPU# 291# / _-----=> irqs-off 292# | / _----=> need-resched 293# || / _---=> hardirq/softirq 294# ||| / _--=> preempt-depth 295# |||| / 296# ||||| delay 297# cmd pid ||||| time | caller 298# \ / ||||| \ | / 299 <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) 300 <idle>-0 0d.s. 97us : __do_softirq (do_softirq) 301 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) 302 303 304This shows that the current tracer is "irqsoff" tracing the time 305for which interrupts were disabled. It gives the trace version 306and the version of the kernel upon which this was executed on 307(2.6.26-rc8). Then it displays the max latency in microsecs (97 308us). The number of trace entries displayed and the total number 309recorded (both are three: #3/3). The type of preemption that was 310used (PREEMPT). VP, KP, SP, and HP are always zero and are 311reserved for later use. #P is the number of online CPUS (#P:2). 312 313The task is the process that was running when the latency 314occurred. (swapper pid: 0). 315 316The start and stop (the functions in which the interrupts were 317disabled and enabled respectively) that caused the latencies: 318 319 apic_timer_interrupt is where the interrupts were disabled. 320 do_softirq is where they were enabled again. 321 322The next lines after the header are the trace itself. The header 323explains which is which. 324 325 cmd: The name of the process in the trace. 326 327 pid: The PID of that process. 328 329 CPU#: The CPU which the process was running on. 330 331 irqs-off: 'd' interrupts are disabled. '.' otherwise. 332 Note: If the architecture does not support a way to 333 read the irq flags variable, an 'X' will always 334 be printed here. 335 336 need-resched: 'N' task need_resched is set, '.' otherwise. 337 338 hardirq/softirq: 339 'H' - hard irq occurred inside a softirq. 340 'h' - hard irq is running 341 's' - soft irq is running 342 '.' - normal context. 343 344 preempt-depth: The level of preempt_disabled 345 346The above is mostly meaningful for kernel developers. 347 348 time: When the latency-format option is enabled, the trace file 349 output includes a timestamp relative to the start of the 350 trace. This differs from the output when latency-format 351 is disabled, which includes an absolute timestamp. 352 353 delay: This is just to help catch your eye a bit better. And 354 needs to be fixed to be only relative to the same CPU. 355 The marks are determined by the difference between this 356 current trace and the next trace. 357 '!' - greater than preempt_mark_thresh (default 100) 358 '+' - greater than 1 microsecond 359 ' ' - less than or equal to 1 microsecond. 360 361 The rest is the same as the 'trace' file. 362 363 364trace_options 365------------- 366 367The trace_options file is used to control what gets printed in 368the trace output. To see what is available, simply cat the file: 369 370 cat trace_options 371 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ 372 noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj 373 374To disable one of the options, echo in the option prepended with 375"no". 376 377 echo noprint-parent > trace_options 378 379To enable an option, leave off the "no". 380 381 echo sym-offset > trace_options 382 383Here are the available options: 384 385 print-parent - On function traces, display the calling (parent) 386 function as well as the function being traced. 387 388 print-parent: 389 bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul 390 391 noprint-parent: 392 bash-4000 [01] 1477.606694: simple_strtoul 393 394 395 sym-offset - Display not only the function name, but also the 396 offset in the function. For example, instead of 397 seeing just "ktime_get", you will see 398 "ktime_get+0xb/0x20". 399 400 sym-offset: 401 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 402 403 sym-addr - this will also display the function address as well 404 as the function name. 405 406 sym-addr: 407 bash-4000 [01] 1477.606694: simple_strtoul <c0339346> 408 409 verbose - This deals with the trace file when the 410 latency-format option is enabled. 411 412 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ 413 (+0.000ms): simple_strtoul (strict_strtoul) 414 415 raw - This will display raw numbers. This option is best for 416 use with user applications that can translate the raw 417 numbers better than having it done in the kernel. 418 419 hex - Similar to raw, but the numbers will be in a hexadecimal 420 format. 421 422 bin - This will print out the formats in raw binary. 423 424 block - TBD (needs update) 425 426 stacktrace - This is one of the options that changes the trace 427 itself. When a trace is recorded, so is the stack 428 of functions. This allows for back traces of 429 trace sites. 430 431 userstacktrace - This option changes the trace. It records a 432 stacktrace of the current userspace thread. 433 434 sym-userobj - when user stacktrace are enabled, look up which 435 object the address belongs to, and print a 436 relative address. This is especially useful when 437 ASLR is on, otherwise you don't get a chance to 438 resolve the address to object/file/line after 439 the app is no longer running 440 441 The lookup is performed when you read 442 trace,trace_pipe. Example: 443 444 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 445x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] 446 447 sched-tree - trace all tasks that are on the runqueue, at 448 every scheduling event. Will add overhead if 449 there's a lot of tasks running at once. 450 451 latency-format - This option changes the trace. When 452 it is enabled, the trace displays 453 additional information about the 454 latencies, as described in "Latency 455 trace format". 456 457 overwrite - This controls what happens when the trace buffer is 458 full. If "1" (default), the oldest events are 459 discarded and overwritten. If "0", then the newest 460 events are discarded. 461 462ftrace_enabled 463-------------- 464 465The following tracers (listed below) give different output 466depending on whether or not the sysctl ftrace_enabled is set. To 467set ftrace_enabled, one can either use the sysctl function or 468set it via the proc file system interface. 469 470 sysctl kernel.ftrace_enabled=1 471 472 or 473 474 echo 1 > /proc/sys/kernel/ftrace_enabled 475 476To disable ftrace_enabled simply replace the '1' with '0' in the 477above commands. 478 479When ftrace_enabled is set the tracers will also record the 480functions that are within the trace. The descriptions of the 481tracers will also show an example with ftrace enabled. 482 483 484irqsoff 485------- 486 487When interrupts are disabled, the CPU can not react to any other 488external event (besides NMIs and SMIs). This prevents the timer 489interrupt from triggering or the mouse interrupt from letting 490the kernel know of a new mouse event. The result is a latency 491with the reaction time. 492 493The irqsoff tracer tracks the time for which interrupts are 494disabled. When a new maximum latency is hit, the tracer saves 495the trace leading up to that latency point so that every time a 496new maximum is reached, the old saved trace is discarded and the 497new trace is saved. 498 499To reset the maximum, echo 0 into tracing_max_latency. Here is 500an example: 501 502 # echo irqsoff > current_tracer 503 # echo latency-format > trace_options 504 # echo 0 > tracing_max_latency 505 # echo 1 > tracing_on 506 # ls -ltr 507 [...] 508 # echo 0 > tracing_on 509 # cat trace 510# tracer: irqsoff 511# 512irqsoff latency trace v1.1.5 on 2.6.26 513-------------------------------------------------------------------- 514 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 515 ----------------- 516 | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) 517 ----------------- 518 => started at: sys_setpgid 519 => ended at: sys_setpgid 520 521# _------=> CPU# 522# / _-----=> irqs-off 523# | / _----=> need-resched 524# || / _---=> hardirq/softirq 525# ||| / _--=> preempt-depth 526# |||| / 527# ||||| delay 528# cmd pid ||||| time | caller 529# \ / ||||| \ | / 530 bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) 531 bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) 532 bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) 533 534 535Here we see that that we had a latency of 12 microsecs (which is 536very good). The _write_lock_irq in sys_setpgid disabled 537interrupts. The difference between the 12 and the displayed 538timestamp 14us occurred because the clock was incremented 539between the time of recording the max latency and the time of 540recording the function that had that latency. 541 542Note the above example had ftrace_enabled not set. If we set the 543ftrace_enabled, we get a much larger output: 544 545# tracer: irqsoff 546# 547irqsoff latency trace v1.1.5 on 2.6.26-rc8 548-------------------------------------------------------------------- 549 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 550 ----------------- 551 | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) 552 ----------------- 553 => started at: __alloc_pages_internal 554 => ended at: __alloc_pages_internal 555 556# _------=> CPU# 557# / _-----=> irqs-off 558# | / _----=> need-resched 559# || / _---=> hardirq/softirq 560# ||| / _--=> preempt-depth 561# |||| / 562# ||||| delay 563# cmd pid ||||| time | caller 564# \ / ||||| \ | / 565 ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) 566 ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) 567 ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) 568 ls-4339 0d..1 4us : add_preempt_count (_spin_lock) 569 ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) 570 ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) 571 ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) 572 ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) 573 ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) 574 ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest) 575 ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk) 576 ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue) 577[...] 578 ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) 579 ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) 580 ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) 581 ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) 582 ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) 583 ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) 584 ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) 585 ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) 586 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) 587 588 589 590Here we traced a 50 microsecond latency. But we also see all the 591functions that were called during that time. Note that by 592enabling function tracing, we incur an added overhead. This 593overhead may extend the latency times. But nevertheless, this 594trace has provided some very helpful debugging information. 595 596 597preemptoff 598---------- 599 600When preemption is disabled, we may be able to receive 601interrupts but the task cannot be preempted and a higher 602priority task must wait for preemption to be enabled again 603before it can preempt a lower priority task. 604 605The preemptoff tracer traces the places that disable preemption. 606Like the irqsoff tracer, it records the maximum latency for 607which preemption was disabled. The control of preemptoff tracer 608is much like the irqsoff tracer. 609 610 # echo preemptoff > current_tracer 611 # echo latency-format > trace_options 612 # echo 0 > tracing_max_latency 613 # echo 1 > tracing_on 614 # ls -ltr 615 [...] 616 # echo 0 > tracing_on 617 # cat trace 618# tracer: preemptoff 619# 620preemptoff latency trace v1.1.5 on 2.6.26-rc8 621-------------------------------------------------------------------- 622 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 623 ----------------- 624 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 625 ----------------- 626 => started at: do_IRQ 627 => ended at: __do_softirq 628 629# _------=> CPU# 630# / _-----=> irqs-off 631# | / _----=> need-resched 632# || / _---=> hardirq/softirq 633# ||| / _--=> preempt-depth 634# |||| / 635# ||||| delay 636# cmd pid ||||| time | caller 637# \ / ||||| \ | / 638 sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) 639 sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) 640 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) 641 642 643This has some more changes. Preemption was disabled when an 644interrupt came in (notice the 'h'), and was enabled while doing 645a softirq. (notice the 's'). But we also see that interrupts 646have been disabled when entering the preempt off section and 647leaving it (the 'd'). We do not know if interrupts were enabled 648in the mean time. 649 650# tracer: preemptoff 651# 652preemptoff latency trace v1.1.5 on 2.6.26-rc8 653-------------------------------------------------------------------- 654 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 655 ----------------- 656 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 657 ----------------- 658 => started at: remove_wait_queue 659 => ended at: __do_softirq 660 661# _------=> CPU# 662# / _-----=> irqs-off 663# | / _----=> need-resched 664# || / _---=> hardirq/softirq 665# ||| / _--=> preempt-depth 666# |||| / 667# ||||| delay 668# cmd pid ||||| time | caller 669# \ / ||||| \ | / 670 sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) 671 sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) 672 sshd-4261 0d..1 2us : do_IRQ (common_interrupt) 673 sshd-4261 0d..1 2us : irq_enter (do_IRQ) 674 sshd-4261 0d..1 2us : idle_cpu (irq_enter) 675 sshd-4261 0d..1 3us : add_preempt_count (irq_enter) 676 sshd-4261 0d.h1 3us : idle_cpu (irq_enter) 677 sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ) 678[...] 679 sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) 680 sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) 681 sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) 682 sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) 683 sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) 684 sshd-4261 0d.h1 14us : irq_exit (do_IRQ) 685 sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) 686 sshd-4261 0d..2 15us : do_softirq (irq_exit) 687 sshd-4261 0d... 15us : __do_softirq (do_softirq) 688 sshd-4261 0d... 16us : __local_bh_disable (__do_softirq) 689 sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable) 690 sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable) 691 sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable) 692 sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable) 693[...] 694 sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) 695 sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) 696 sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) 697 sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) 698 sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) 699 sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) 700 sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) 701 sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) 702[...] 703 sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) 704 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) 705 706 707The above is an example of the preemptoff trace with 708ftrace_enabled set. Here we see that interrupts were disabled 709the entire time. The irq_enter code lets us know that we entered 710an interrupt 'h'. Before that, the functions being traced still 711show that it is not in an interrupt, but we can see from the 712functions themselves that this is not the case. 713 714Notice that __do_softirq when called does not have a 715preempt_count. It may seem that we missed a preempt enabling. 716What really happened is that the preempt count is held on the 717thread's stack and we switched to the softirq stack (4K stacks 718in effect). The code does not copy the preempt count, but 719because interrupts are disabled, we do not need to worry about 720it. Having a tracer like this is good for letting people know 721what really happens inside the kernel. 722 723 724preemptirqsoff 725-------------- 726 727Knowing the locations that have interrupts disabled or 728preemption disabled for the longest times is helpful. But 729sometimes we would like to know when either preemption and/or 730interrupts are disabled. 731 732Consider the following code: 733 734 local_irq_disable(); 735 call_function_with_irqs_off(); 736 preempt_disable(); 737 call_function_with_irqs_and_preemption_off(); 738 local_irq_enable(); 739 call_function_with_preemption_off(); 740 preempt_enable(); 741 742The irqsoff tracer will record the total length of 743call_function_with_irqs_off() and 744call_function_with_irqs_and_preemption_off(). 745 746The preemptoff tracer will record the total length of 747call_function_with_irqs_and_preemption_off() and 748call_function_with_preemption_off(). 749 750But neither will trace the time that interrupts and/or 751preemption is disabled. This total time is the time that we can 752not schedule. To record this time, use the preemptirqsoff 753tracer. 754 755Again, using this trace is much like the irqsoff and preemptoff 756tracers. 757 758 # echo preemptirqsoff > current_tracer 759 # echo latency-format > trace_options 760 # echo 0 > tracing_max_latency 761 # echo 1 > tracing_on 762 # ls -ltr 763 [...] 764 # echo 0 > tracing_on 765 # cat trace 766# tracer: preemptirqsoff 767# 768preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 769-------------------------------------------------------------------- 770 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 771 ----------------- 772 | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) 773 ----------------- 774 => started at: apic_timer_interrupt 775 => ended at: __do_softirq 776 777# _------=> CPU# 778# / _-----=> irqs-off 779# | / _----=> need-resched 780# || / _---=> hardirq/softirq 781# ||| / _--=> preempt-depth 782# |||| / 783# ||||| delay 784# cmd pid ||||| time | caller 785# \ / ||||| \ | / 786 ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) 787 ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) 788 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) 789 790 791 792The trace_hardirqs_off_thunk is called from assembly on x86 when 793interrupts are disabled in the assembly code. Without the 794function tracing, we do not know if interrupts were enabled 795within the preemption points. We do see that it started with 796preemption enabled. 797 798Here is a trace with ftrace_enabled set: 799 800 801# tracer: preemptirqsoff 802# 803preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 804-------------------------------------------------------------------- 805 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 806 ----------------- 807 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 808 ----------------- 809 => started at: write_chan 810 => ended at: __do_softirq 811 812# _------=> CPU# 813# / _-----=> irqs-off 814# | / _----=> need-resched 815# || / _---=> hardirq/softirq 816# ||| / _--=> preempt-depth 817# |||| / 818# ||||| delay 819# cmd pid ||||| time | caller 820# \ / ||||| \ | / 821 ls-4473 0.N.. 0us : preempt_schedule (write_chan) 822 ls-4473 0dN.1 1us : _spin_lock (schedule) 823 ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) 824 ls-4473 0d..2 2us : put_prev_task_fair (schedule) 825[...] 826 ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) 827 ls-4473 0d..2 13us : __switch_to (schedule) 828 sshd-4261 0d..2 14us : finish_task_switch (schedule) 829 sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) 830 sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) 831 sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) 832 sshd-4261 0d..2 16us : do_IRQ (common_interrupt) 833 sshd-4261 0d..2 17us : irq_enter (do_IRQ) 834 sshd-4261 0d..2 17us : idle_cpu (irq_enter) 835 sshd-4261 0d..2 18us : add_preempt_count (irq_enter) 836 sshd-4261 0d.h2 18us : idle_cpu (irq_enter) 837 sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) 838 sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) 839 sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) 840 sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) 841 sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) 842[...] 843 sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) 844 sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) 845 sshd-4261 0d.h2 29us : irq_exit (do_IRQ) 846 sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) 847 sshd-4261 0d..3 30us : do_softirq (irq_exit) 848 sshd-4261 0d... 30us : __do_softirq (do_softirq) 849 sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) 850 sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) 851 sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) 852[...] 853 sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) 854 sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) 855 sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) 856 sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) 857 sshd-4261 0d.s3 45us : idle_cpu (irq_enter) 858 sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) 859 sshd-4261 0d.H3 46us : idle_cpu (irq_enter) 860 sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) 861 sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) 862[...] 863 sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) 864 sshd-4261 0d.H3 82us : ktime_get (tick_program_event) 865 sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) 866 sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) 867 sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) 868 sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) 869 sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event) 870 sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt) 871 sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit) 872 sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit) 873 sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable) 874[...] 875 sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) 876 sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) 877 sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) 878 sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) 879 sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) 880 sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) 881 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) 882 883 884This is a very interesting trace. It started with the preemption 885of the ls task. We see that the task had the "need_resched" bit 886set via the 'N' in the trace. Interrupts were disabled before 887the spin_lock at the beginning of the trace. We see that a 888schedule took place to run sshd. When the interrupts were 889enabled, we took an interrupt. On return from the interrupt 890handler, the softirq ran. We took another interrupt while 891running the softirq as we see from the capital 'H'. 892 893 894wakeup 895------ 896 897In a Real-Time environment it is very important to know the 898wakeup time it takes for the highest priority task that is woken 899up to the time that it executes. This is also known as "schedule 900latency". I stress the point that this is about RT tasks. It is 901also important to know the scheduling latency of non-RT tasks, 902but the average schedule latency is better for non-RT tasks. 903Tools like LatencyTop are more appropriate for such 904measurements. 905 906Real-Time environments are interested in the worst case latency. 907That is the longest latency it takes for something to happen, 908and not the average. We can have a very fast scheduler that may 909only have a large latency once in a while, but that would not 910work well with Real-Time tasks. The wakeup tracer was designed 911to record the worst case wakeups of RT tasks. Non-RT tasks are 912not recorded because the tracer only records one worst case and 913tracing non-RT tasks that are unpredictable will overwrite the 914worst case latency of RT tasks. 915 916Since this tracer only deals with RT tasks, we will run this 917slightly differently than we did with the previous tracers. 918Instead of performing an 'ls', we will run 'sleep 1' under 919'chrt' which changes the priority of the task. 920 921 # echo wakeup > current_tracer 922 # echo latency-format > trace_options 923 # echo 0 > tracing_max_latency 924 # echo 1 > tracing_on 925 # chrt -f 5 sleep 1 926 # echo 0 > tracing_on 927 # cat trace 928# tracer: wakeup 929# 930wakeup latency trace v1.1.5 on 2.6.26-rc8 931-------------------------------------------------------------------- 932 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 933 ----------------- 934 | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) 935 ----------------- 936 937# _------=> CPU# 938# / _-----=> irqs-off 939# | / _----=> need-resched 940# || / _---=> hardirq/softirq 941# ||| / _--=> preempt-depth 942# |||| / 943# ||||| delay 944# cmd pid ||||| time | caller 945# \ / ||||| \ | / 946 <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) 947 <idle>-0 1d..4 4us : schedule (cpu_idle) 948 949 950Running this on an idle system, we see that it only took 4 951microseconds to perform the task switch. Note, since the trace 952marker in the schedule is before the actual "switch", we stop 953the tracing when the recorded task is about to schedule in. This 954may change if we add a new marker at the end of the scheduler. 955 956Notice that the recorded task is 'sleep' with the PID of 4901 957and it has an rt_prio of 5. This priority is user-space priority 958and not the internal kernel priority. The policy is 1 for 959SCHED_FIFO and 2 for SCHED_RR. 960 961Doing the same with chrt -r 5 and ftrace_enabled set. 962 963# tracer: wakeup 964# 965wakeup latency trace v1.1.5 on 2.6.26-rc8 966-------------------------------------------------------------------- 967 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 968 ----------------- 969 | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) 970 ----------------- 971 972# _------=> CPU# 973# / _-----=> irqs-off 974# | / _----=> need-resched 975# || / _---=> hardirq/softirq 976# ||| / _--=> preempt-depth 977# |||| / 978# ||||| delay 979# cmd pid ||||| time | caller 980# \ / ||||| \ | / 981ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) 982ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) 983ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) 984ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) 985ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) 986ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) 987ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) 988ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) 989[...] 990ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) 991ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) 992ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) 993ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) 994[...] 995ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) 996ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) 997ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) 998ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) 999ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) 1000ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) 1001ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) 1002ksoftirq-7 1.N.2 33us : schedule (__cond_resched) 1003ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) 1004ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) 1005ksoftirq-7 1dN.3 35us : _spin_lock (schedule) 1006ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) 1007ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) 1008ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) 1009[...] 1010ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) 1011ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) 1012ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) 1013ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) 1014ksoftirq-7 1d..4 50us : schedule (__cond_resched) 1015 1016The interrupt went off while running ksoftirqd. This task runs 1017at SCHED_OTHER. Why did not we see the 'N' set early? This may 1018be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K 1019stacks configured, the interrupt and softirq run with their own 1020stack. Some information is held on the top of the task's stack 1021(need_resched and preempt_count are both stored there). The 1022setting of the NEED_RESCHED bit is done directly to the task's 1023stack, but the reading of the NEED_RESCHED is done by looking at 1024the current stack, which in this case is the stack for the hard 1025interrupt. This hides the fact that NEED_RESCHED has been set. 1026We do not see the 'N' until we switch back to the task's 1027assigned stack. 1028 1029function 1030-------- 1031 1032This tracer is the function tracer. Enabling the function tracer 1033can be done from the debug file system. Make sure the 1034ftrace_enabled is set; otherwise this tracer is a nop. 1035 1036 # sysctl kernel.ftrace_enabled=1 1037 # echo function > current_tracer 1038 # echo 1 > tracing_on 1039 # usleep 1 1040 # echo 0 > tracing_on 1041 # cat trace 1042# tracer: function 1043# 1044# TASK-PID CPU# TIMESTAMP FUNCTION 1045# | | | | | 1046 bash-4003 [00] 123.638713: finish_task_switch <-schedule 1047 bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch 1048 bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq 1049 bash-4003 [00] 123.638715: hrtick_set <-schedule 1050 bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set 1051 bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave 1052 bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set 1053 bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore 1054 bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set 1055 bash-4003 [00] 123.638718: sub_preempt_count <-schedule 1056 bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule 1057 bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run 1058 bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion 1059 bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common 1060 bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq 1061[...] 1062 1063 1064Note: function tracer uses ring buffers to store the above 1065entries. The newest data may overwrite the oldest data. 1066Sometimes using echo to stop the trace is not sufficient because 1067the tracing could have overwritten the data that you wanted to 1068record. For this reason, it is sometimes better to disable 1069tracing directly from a program. This allows you to stop the 1070tracing at the point that you hit the part that you are 1071interested in. To disable the tracing directly from a C program, 1072something like following code snippet can be used: 1073 1074int trace_fd; 1075[...] 1076int main(int argc, char *argv[]) { 1077 [...] 1078 trace_fd = open(tracing_file("tracing_on"), O_WRONLY); 1079 [...] 1080 if (condition_hit()) { 1081 write(trace_fd, "0", 1); 1082 } 1083 [...] 1084} 1085 1086 1087Single thread tracing 1088--------------------- 1089 1090By writing into set_ftrace_pid you can trace a 1091single thread. For example: 1092 1093# cat set_ftrace_pid 1094no pid 1095# echo 3111 > set_ftrace_pid 1096# cat set_ftrace_pid 10973111 1098# echo function > current_tracer 1099# cat trace | head 1100 # tracer: function 1101 # 1102 # TASK-PID CPU# TIMESTAMP FUNCTION 1103 # | | | | | 1104 yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return 1105 yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range 1106 yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel 1107 yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel 1108 yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll 1109 yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll 1110# echo -1 > set_ftrace_pid 1111# cat trace |head 1112 # tracer: function 1113 # 1114 # TASK-PID CPU# TIMESTAMP FUNCTION 1115 # | | | | | 1116 ##### CPU 3 buffer started #### 1117 yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait 1118 yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry 1119 yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry 1120 yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit 1121 yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit 1122 1123If you want to trace a function when executing, you could use 1124something like this simple program: 1125 1126#include <stdio.h> 1127#include <stdlib.h> 1128#include <sys/types.h> 1129#include <sys/stat.h> 1130#include <fcntl.h> 1131#include <unistd.h> 1132#include <string.h> 1133 1134#define _STR(x) #x 1135#define STR(x) _STR(x) 1136#define MAX_PATH 256 1137 1138const char *find_debugfs(void) 1139{ 1140 static char debugfs[MAX_PATH+1]; 1141 static int debugfs_found; 1142 char type[100]; 1143 FILE *fp; 1144 1145 if (debugfs_found) 1146 return debugfs; 1147 1148 if ((fp = fopen("/proc/mounts","r")) == NULL) { 1149 perror("/proc/mounts"); 1150 return NULL; 1151 } 1152 1153 while (fscanf(fp, "%*s %" 1154 STR(MAX_PATH) 1155 "s %99s %*s %*d %*d\n", 1156 debugfs, type) == 2) { 1157 if (strcmp(type, "debugfs") == 0) 1158 break; 1159 } 1160 fclose(fp); 1161 1162 if (strcmp(type, "debugfs") != 0) { 1163 fprintf(stderr, "debugfs not mounted"); 1164 return NULL; 1165 } 1166 1167 strcat(debugfs, "/tracing/"); 1168 debugfs_found = 1; 1169 1170 return debugfs; 1171} 1172 1173const char *tracing_file(const char *file_name) 1174{ 1175 static char trace_file[MAX_PATH+1]; 1176 snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); 1177 return trace_file; 1178} 1179 1180int main (int argc, char **argv) 1181{ 1182 if (argc < 1) 1183 exit(-1); 1184 1185 if (fork() > 0) { 1186 int fd, ffd; 1187 char line[64]; 1188 int s; 1189 1190 ffd = open(tracing_file("current_tracer"), O_WRONLY); 1191 if (ffd < 0) 1192 exit(-1); 1193 write(ffd, "nop", 3); 1194 1195 fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); 1196 s = sprintf(line, "%d\n", getpid()); 1197 write(fd, line, s); 1198 1199 write(ffd, "function", 8); 1200 1201 close(fd); 1202 close(ffd); 1203 1204 execvp(argv[1], argv+1); 1205 } 1206 1207 return 0; 1208} 1209 1210 1211hw-branch-tracer (x86 only) 1212--------------------------- 1213 1214This tracer uses the x86 last branch tracing hardware feature to 1215collect a branch trace on all cpus with relatively low overhead. 1216 1217The tracer uses a fixed-size circular buffer per cpu and only 1218traces ring 0 branches. The trace file dumps that buffer in the 1219following format: 1220 1221# tracer: hw-branch-tracer 1222# 1223# CPU# TO <- FROM 1224 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 1225 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a 1226 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf 1227 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf 1228 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a 1229 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf 1230 1231 1232The tracer may be used to dump the trace for the oops'ing cpu on 1233a kernel oops into the system log. To enable this, 1234ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one 1235can either use the sysctl function or set it via the proc system 1236interface. 1237 1238 sysctl kernel.ftrace_dump_on_oops=n 1239 1240or 1241 1242 echo n > /proc/sys/kernel/ftrace_dump_on_oops 1243 1244If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will 1245only dump the buffer of the CPU that triggered the oops. 1246 1247Here's an example of such a dump after a null pointer 1248dereference in a kernel module: 1249 1250[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 1251[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] 1252[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 1253[57848.106019] Oops: 0002 [#1] SMP 1254[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus 1255[57848.106019] Dumping ftrace buffer: 1256[57848.106019] --------------------------------- 1257[...] 1258[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 1259[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 1260[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 1261[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 1262[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 1263[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] 1264[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 1265[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b 1266[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 1267[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 1268[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 1269[...] 1270[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 1271[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 1272[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 1273[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 1274[...] 1275[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 1276[57848.106019] --------------------------------- 1277[57848.106019] CPU 0 1278[57848.106019] Modules linked in: oops 1279[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 1280[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] 1281[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 1282[...] 1283 1284 1285function graph tracer 1286--------------------------- 1287 1288This tracer is similar to the function tracer except that it 1289probes a function on its entry and its exit. This is done by 1290using a dynamically allocated stack of return addresses in each 1291task_struct. On function entry the tracer overwrites the return 1292address of each function traced to set a custom probe. Thus the 1293original return address is stored on the stack of return address 1294in the task_struct. 1295 1296Probing on both ends of a function leads to special features 1297such as: 1298 1299- measure of a function's time execution 1300- having a reliable call stack to draw function calls graph 1301 1302This tracer is useful in several situations: 1303 1304- you want to find the reason of a strange kernel behavior and 1305 need to see what happens in detail on any areas (or specific 1306 ones). 1307 1308- you are experiencing weird latencies but it's difficult to 1309 find its origin. 1310 1311- you want to find quickly which path is taken by a specific 1312 function 1313 1314- you just want to peek inside a working kernel and want to see 1315 what happens there. 1316 1317# tracer: function_graph 1318# 1319# CPU DURATION FUNCTION CALLS 1320# | | | | | | | 1321 1322 0) | sys_open() { 1323 0) | do_sys_open() { 1324 0) | getname() { 1325 0) | kmem_cache_alloc() { 1326 0) 1.382 us | __might_sleep(); 1327 0) 2.478 us | } 1328 0) | strncpy_from_user() { 1329 0) | might_fault() { 1330 0) 1.389 us | __might_sleep(); 1331 0) 2.553 us | } 1332 0) 3.807 us | } 1333 0) 7.876 us | } 1334 0) | alloc_fd() { 1335 0) 0.668 us | _spin_lock(); 1336 0) 0.570 us | expand_files(); 1337 0) 0.586 us | _spin_unlock(); 1338 1339 1340There are several columns that can be dynamically 1341enabled/disabled. You can use every combination of options you 1342want, depending on your needs. 1343 1344- The cpu number on which the function executed is default 1345 enabled. It is sometimes better to only trace one cpu (see 1346 tracing_cpu_mask file) or you might sometimes see unordered 1347 function calls while cpu tracing switch. 1348 1349 hide: echo nofuncgraph-cpu > trace_options 1350 show: echo funcgraph-cpu > trace_options 1351 1352- The duration (function's time of execution) is displayed on 1353 the closing bracket line of a function or on the same line 1354 than the current function in case of a leaf one. It is default 1355 enabled. 1356 1357 hide: echo nofuncgraph-duration > trace_options 1358 show: echo funcgraph-duration > trace_options 1359 1360- The overhead field precedes the duration field in case of 1361 reached duration thresholds. 1362 1363 hide: echo nofuncgraph-overhead > trace_options 1364 show: echo funcgraph-overhead > trace_options 1365 depends on: funcgraph-duration 1366 1367 ie: 1368 1369 0) | up_write() { 1370 0) 0.646 us | _spin_lock_irqsave(); 1371 0) 0.684 us | _spin_unlock_irqrestore(); 1372 0) 3.123 us | } 1373 0) 0.548 us | fput(); 1374 0) + 58.628 us | } 1375 1376 [...] 1377 1378 0) | putname() { 1379 0) | kmem_cache_free() { 1380 0) 0.518 us | __phys_addr(); 1381 0) 1.757 us | } 1382 0) 2.861 us | } 1383 0) ! 115.305 us | } 1384 0) ! 116.402 us | } 1385 1386 + means that the function exceeded 10 usecs. 1387 ! means that the function exceeded 100 usecs. 1388 1389 1390- The task/pid field displays the thread cmdline and pid which 1391 executed the function. It is default disabled. 1392 1393 hide: echo nofuncgraph-proc > trace_options 1394 show: echo funcgraph-proc > trace_options 1395 1396 ie: 1397 1398 # tracer: function_graph 1399 # 1400 # CPU TASK/PID DURATION FUNCTION CALLS 1401 # | | | | | | | | | 1402 0) sh-4802 | | d_free() { 1403 0) sh-4802 | | call_rcu() { 1404 0) sh-4802 | | __call_rcu() { 1405 0) sh-4802 | 0.616 us | rcu_process_gp_end(); 1406 0) sh-4802 | 0.586 us | check_for_new_grace_period(); 1407 0) sh-4802 | 2.899 us | } 1408 0) sh-4802 | 4.040 us | } 1409 0) sh-4802 | 5.151 us | } 1410 0) sh-4802 | + 49.370 us | } 1411 1412 1413- The absolute time field is an absolute timestamp given by the 1414 system clock since it started. A snapshot of this time is 1415 given on each entry/exit of functions 1416 1417 hide: echo nofuncgraph-abstime > trace_options 1418 show: echo funcgraph-abstime > trace_options 1419 1420 ie: 1421 1422 # 1423 # TIME CPU DURATION FUNCTION CALLS 1424 # | | | | | | | | 1425 360.774522 | 1) 0.541 us | } 1426 360.774522 | 1) 4.663 us | } 1427 360.774523 | 1) 0.541 us | __wake_up_bit(); 1428 360.774524 | 1) 6.796 us | } 1429 360.774524 | 1) 7.952 us | } 1430 360.774525 | 1) 9.063 us | } 1431 360.774525 | 1) 0.615 us | journal_mark_dirty(); 1432 360.774527 | 1) 0.578 us | __brelse(); 1433 360.774528 | 1) | reiserfs_prepare_for_journal() { 1434 360.774528 | 1) | unlock_buffer() { 1435 360.774529 | 1) | wake_up_bit() { 1436 360.774529 | 1) | bit_waitqueue() { 1437 360.774530 | 1) 0.594 us | __phys_addr(); 1438 1439 1440You can put some comments on specific functions by using 1441trace_printk() For example, if you want to put a comment inside 1442the __might_sleep() function, you just have to include 1443<linux/ftrace.h> and call trace_printk() inside __might_sleep() 1444 1445trace_printk("I'm a comment!\n") 1446 1447will produce: 1448 1449 1) | __might_sleep() { 1450 1) | /* I'm a comment! */ 1451 1) 1.449 us | } 1452 1453 1454You might find other useful features for this tracer in the 1455following "dynamic ftrace" section such as tracing only specific 1456functions or tasks. 1457 1458dynamic ftrace 1459-------------- 1460 1461If CONFIG_DYNAMIC_FTRACE is set, the system will run with 1462virtually no overhead when function tracing is disabled. The way 1463this works is the mcount function call (placed at the start of 1464every kernel function, produced by the -pg switch in gcc), 1465starts of pointing to a simple return. (Enabling FTRACE will 1466include the -pg switch in the compiling of the kernel.) 1467 1468At compile time every C file object is run through the 1469recordmcount.pl script (located in the scripts directory). This 1470script will process the C object using objdump to find all the 1471locations in the .text section that call mcount. (Note, only the 1472.text section is processed, since processing other sections like 1473.init.text may cause races due to those sections being freed). 1474 1475A new section called "__mcount_loc" is created that holds 1476references to all the mcount call sites in the .text section. 1477This section is compiled back into the original object. The 1478final linker will add all these references into a single table. 1479 1480On boot up, before SMP is initialized, the dynamic ftrace code 1481scans this table and updates all the locations into nops. It 1482also records the locations, which are added to the 1483available_filter_functions list. Modules are processed as they 1484are loaded and before they are executed. When a module is 1485unloaded, it also removes its functions from the ftrace function 1486list. This is automatic in the module unload code, and the 1487module author does not need to worry about it. 1488 1489When tracing is enabled, kstop_machine is called to prevent 1490races with the CPUS executing code being modified (which can 1491cause the CPU to do undesirable things), and the nops are 1492patched back to calls. But this time, they do not call mcount 1493(which is just a function stub). They now call into the ftrace 1494infrastructure. 1495 1496One special side-effect to the recording of the functions being 1497traced is that we can now selectively choose which functions we 1498wish to trace and which ones we want the mcount calls to remain 1499as nops. 1500 1501Two files are used, one for enabling and one for disabling the 1502tracing of specified functions. They are: 1503 1504 set_ftrace_filter 1505 1506and 1507 1508 set_ftrace_notrace 1509 1510A list of available functions that you can add to these files is 1511listed in: 1512 1513 available_filter_functions 1514 1515 # cat available_filter_functions 1516put_prev_task_idle 1517kmem_cache_create 1518pick_next_task_rt 1519get_online_cpus 1520pick_next_task_fair 1521mutex_lock 1522[...] 1523 1524If I am only interested in sys_nanosleep and hrtimer_interrupt: 1525 1526 # echo sys_nanosleep hrtimer_interrupt \ 1527 > set_ftrace_filter 1528 # echo function > current_tracer 1529 # echo 1 > tracing_on 1530 # usleep 1 1531 # echo 0 > tracing_on 1532 # cat trace 1533# tracer: ftrace 1534# 1535# TASK-PID CPU# TIMESTAMP FUNCTION 1536# | | | | | 1537 usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt 1538 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call 1539 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt 1540 1541To see which functions are being traced, you can cat the file: 1542 1543 # cat set_ftrace_filter 1544hrtimer_interrupt 1545sys_nanosleep 1546 1547 1548Perhaps this is not enough. The filters also allow simple wild 1549cards. Only the following are currently available 1550 1551 <match>* - will match functions that begin with <match> 1552 *<match> - will match functions that end with <match> 1553 *<match>* - will match functions that have <match> in it 1554 1555These are the only wild cards which are supported. 1556 1557 <match>*<match> will not work. 1558 1559Note: It is better to use quotes to enclose the wild cards, 1560 otherwise the shell may expand the parameters into names 1561 of files in the local directory. 1562 1563 # echo 'hrtimer_*' > set_ftrace_filter 1564 1565Produces: 1566 1567# tracer: ftrace 1568# 1569# TASK-PID CPU# TIMESTAMP FUNCTION 1570# | | | | | 1571 bash-4003 [00] 1480.611794: hrtimer_init <-copy_process 1572 bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set 1573 bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear 1574 bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel 1575 <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt 1576 <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt 1577 <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt 1578 <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt 1579 <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt 1580 1581 1582Notice that we lost the sys_nanosleep. 1583 1584 # cat set_ftrace_filter 1585hrtimer_run_queues 1586hrtimer_run_pending 1587hrtimer_init 1588hrtimer_cancel 1589hrtimer_try_to_cancel 1590hrtimer_forward 1591hrtimer_start 1592hrtimer_reprogram 1593hrtimer_force_reprogram 1594hrtimer_get_next_event 1595hrtimer_interrupt 1596hrtimer_nanosleep 1597hrtimer_wakeup 1598hrtimer_get_remaining 1599hrtimer_get_res 1600hrtimer_init_sleeper 1601 1602 1603This is because the '>' and '>>' act just like they do in bash. 1604To rewrite the filters, use '>' 1605To append to the filters, use '>>' 1606 1607To clear out a filter so that all functions will be recorded 1608again: 1609 1610 # echo > set_ftrace_filter 1611 # cat set_ftrace_filter 1612 # 1613 1614Again, now we want to append. 1615 1616 # echo sys_nanosleep > set_ftrace_filter 1617 # cat set_ftrace_filter 1618sys_nanosleep 1619 # echo 'hrtimer_*' >> set_ftrace_filter 1620 # cat set_ftrace_filter 1621hrtimer_run_queues 1622hrtimer_run_pending 1623hrtimer_init 1624hrtimer_cancel 1625hrtimer_try_to_cancel 1626hrtimer_forward 1627hrtimer_start 1628hrtimer_reprogram 1629hrtimer_force_reprogram 1630hrtimer_get_next_event 1631hrtimer_interrupt 1632sys_nanosleep 1633hrtimer_nanosleep 1634hrtimer_wakeup 1635hrtimer_get_remaining 1636hrtimer_get_res 1637hrtimer_init_sleeper 1638 1639 1640The set_ftrace_notrace prevents those functions from being 1641traced. 1642 1643 # echo '*preempt*' '*lock*' > set_ftrace_notrace 1644 1645Produces: 1646 1647# tracer: ftrace 1648# 1649# TASK-PID CPU# TIMESTAMP FUNCTION 1650# | | | | | 1651 bash-4043 [01] 115.281644: finish_task_switch <-schedule 1652 bash-4043 [01] 115.281645: hrtick_set <-schedule 1653 bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set 1654 bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run 1655 bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion 1656 bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run 1657 bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop 1658 bash-4043 [01] 115.281648: wake_up_process <-kthread_stop 1659 bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process 1660 1661We can see that there's no more lock or preempt tracing. 1662 1663 1664Dynamic ftrace with the function graph tracer 1665--------------------------------------------- 1666 1667Although what has been explained above concerns both the 1668function tracer and the function-graph-tracer, there are some 1669special features only available in the function-graph tracer. 1670 1671If you want to trace only one function and all of its children, 1672you just have to echo its name into set_graph_function: 1673 1674 echo __do_fault > set_graph_function 1675 1676will produce the following "expanded" trace of the __do_fault() 1677function: 1678 1679 0) | __do_fault() { 1680 0) | filemap_fault() { 1681 0) | find_lock_page() { 1682 0) 0.804 us | find_get_page(); 1683 0) | __might_sleep() { 1684 0) 1.329 us | } 1685 0) 3.904 us | } 1686 0) 4.979 us | } 1687 0) 0.653 us | _spin_lock(); 1688 0) 0.578 us | page_add_file_rmap(); 1689 0) 0.525 us | native_set_pte_at(); 1690 0) 0.585 us | _spin_unlock(); 1691 0) | unlock_page() { 1692 0) 0.541 us | page_waitqueue(); 1693 0) 0.639 us | __wake_up_bit(); 1694 0) 2.786 us | } 1695 0) + 14.237 us | } 1696 0) | __do_fault() { 1697 0) | filemap_fault() { 1698 0) | find_lock_page() { 1699 0) 0.698 us | find_get_page(); 1700 0) | __might_sleep() { 1701 0) 1.412 us | } 1702 0) 3.950 us | } 1703 0) 5.098 us | } 1704 0) 0.631 us | _spin_lock(); 1705 0) 0.571 us | page_add_file_rmap(); 1706 0) 0.526 us | native_set_pte_at(); 1707 0) 0.586 us | _spin_unlock(); 1708 0) | unlock_page() { 1709 0) 0.533 us | page_waitqueue(); 1710 0) 0.638 us | __wake_up_bit(); 1711 0) 2.793 us | } 1712 0) + 14.012 us | } 1713 1714You can also expand several functions at once: 1715 1716 echo sys_open > set_graph_function 1717 echo sys_close >> set_graph_function 1718 1719Now if you want to go back to trace all functions you can clear 1720this special filter via: 1721 1722 echo > set_graph_function 1723 1724 1725Filter commands 1726--------------- 1727 1728A few commands are supported by the set_ftrace_filter interface. 1729Trace commands have the following format: 1730 1731<function>:<command>:<parameter> 1732 1733The following commands are supported: 1734 1735- mod 1736 This command enables function filtering per module. The 1737 parameter defines the module. For example, if only the write* 1738 functions in the ext3 module are desired, run: 1739 1740 echo 'write*:mod:ext3' > set_ftrace_filter 1741 1742 This command interacts with the filter in the same way as 1743 filtering based on function names. Thus, adding more functions 1744 in a different module is accomplished by appending (>>) to the 1745 filter file. Remove specific module functions by prepending 1746 '!': 1747 1748 echo '!writeback*:mod:ext3' >> set_ftrace_filter 1749 1750- traceon/traceoff 1751 These commands turn tracing on and off when the specified 1752 functions are hit. The parameter determines how many times the 1753 tracing system is turned on and off. If unspecified, there is 1754 no limit. For example, to disable tracing when a schedule bug 1755 is hit the first 5 times, run: 1756 1757 echo '__schedule_bug:traceoff:5' > set_ftrace_filter 1758 1759 These commands are cumulative whether or not they are appended 1760 to set_ftrace_filter. To remove a command, prepend it by '!' 1761 and drop the parameter: 1762 1763 echo '!__schedule_bug:traceoff' > set_ftrace_filter 1764 1765 1766trace_pipe 1767---------- 1768 1769The trace_pipe outputs the same content as the trace file, but 1770the effect on the tracing is different. Every read from 1771trace_pipe is consumed. This means that subsequent reads will be 1772different. The trace is live. 1773 1774 # echo function > current_tracer 1775 # cat trace_pipe > /tmp/trace.out & 1776[1] 4153 1777 # echo 1 > tracing_on 1778 # usleep 1 1779 # echo 0 > tracing_on 1780 # cat trace 1781# tracer: function 1782# 1783# TASK-PID CPU# TIMESTAMP FUNCTION 1784# | | | | | 1785 1786 # 1787 # cat /tmp/trace.out 1788 bash-4043 [00] 41.267106: finish_task_switch <-schedule 1789 bash-4043 [00] 41.267106: hrtick_set <-schedule 1790 bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set 1791 bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run 1792 bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion 1793 bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run 1794 bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop 1795 bash-4043 [00] 41.267110: wake_up_process <-kthread_stop 1796 bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process 1797 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up 1798 1799 1800Note, reading the trace_pipe file will block until more input is 1801added. By changing the tracer, trace_pipe will issue an EOF. We 1802needed to set the function tracer _before_ we "cat" the 1803trace_pipe file. 1804 1805 1806trace entries 1807------------- 1808 1809Having too much or not enough data can be troublesome in 1810diagnosing an issue in the kernel. The file buffer_size_kb is 1811used to modify the size of the internal trace buffers. The 1812number listed is the number of entries that can be recorded per 1813CPU. To know the full size, multiply the number of possible CPUS 1814with the number of entries. 1815 1816 # cat buffer_size_kb 18171408 (units kilobytes) 1818 1819Note, to modify this, you must have tracing completely disabled. 1820To do that, echo "nop" into the current_tracer. If the 1821current_tracer is not set to "nop", an EINVAL error will be 1822returned. 1823 1824 # echo nop > current_tracer 1825 # echo 10000 > buffer_size_kb 1826 # cat buffer_size_kb 182710000 (units kilobytes) 1828 1829The number of pages which will be allocated is limited to a 1830percentage of available memory. Allocating too much will produce 1831an error. 1832 1833 # echo 1000000000000 > buffer_size_kb 1834-bash: echo: write error: Cannot allocate memory 1835 # cat buffer_size_kb 183685 1837 1838----------- 1839 1840More details can be found in the source code, in the 1841kernel/trace/*.c files. 1842