1============= 2BPF Iterators 3============= 4 5-------- 6Overview 7-------- 8 9BPF supports two separate entities collectively known as "BPF iterators": BPF 10iterator *program type* and *open-coded* BPF iterators. The former is 11a stand-alone BPF program type which, when attached and activated by user, 12will be called once for each entity (task_struct, cgroup, etc) that is being 13iterated. The latter is a set of BPF-side APIs implementing iterator 14functionality and available across multiple BPF program types. Open-coded 15iterators provide similar functionality to BPF iterator programs, but gives 16more flexibility and control to all other BPF program types. BPF iterator 17programs, on the other hand, can be used to implement anonymous or BPF 18FS-mounted special files, whose contents are generated by attached BPF iterator 19program, backed by seq_file functionality. Both are useful depending on 20specific needs. 21 22When adding a new BPF iterator program, it is expected that similar 23functionality will be added as open-coded iterator for maximum flexibility. 24It's also expected that iteration logic and code will be maximally shared and 25reused between two iterator API surfaces. 26 27------------------------ 28Open-coded BPF Iterators 29------------------------ 30 31Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs 32(constructor, next element fetch, destructor) and iterator-specific type 33describing on-the-stack iterator state, which is guaranteed by the BPF 34verifier to not be tampered with outside of the corresponding 35constructor/destructor/next APIs. 36 37Each kind of open-coded BPF iterator has its own associated 38struct bpf_iter_<type>, where <type> denotes a specific type of iterator. 39bpf_iter_<type> state needs to live on BPF program stack, so make sure it's 40small enough to fit on BPF stack. For performance reasons its best to avoid 41dynamic memory allocation for iterator state and size the state struct big 42enough to fit everything necessary. But if necessary, dynamic memory 43allocation is a way to bypass BPF stack limitations. Note, state struct size 44is part of iterator's user-visible API, so changing it will break backwards 45compatibility, so be deliberate about designing it. 46 47All kfuncs (constructor, next, destructor) have to be named consistently as 48bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator 49type, and iterator state should be represented as a matching 50`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have 51a pointer to this `struct bpf_iter_<type>` as the very first argument. 52 53Additionally: 54 - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra 55 number of arguments. Return type is not enforced either. 56 - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer 57 type and should have exactly one argument: `struct bpf_iter_<type> *` 58 (const/volatile/restrict and typedefs are ignored). 59 - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and 60 should have exactly one argument, similar to the next method. 61 - `struct bpf_iter_<type>` size is enforced to be positive and 62 a multiple of 8 bytes (to fit stack slots correctly). 63 64Such strictness and consistency allows to build generic helpers abstracting 65important, but boilerplate, details to be able to use open-coded iterators 66effectively and ergonomically (see libbpf's bpf_for_each() macro). This is 67enforced at kfunc registration point by the kernel. 68 69Constructor/next/destructor implementation contract is as follows: 70 - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on 71 the stack. If any of the input arguments are invalid, constructor should 72 make sure to still initialize it such that subsequent next() calls will 73 return NULL. I.e., on error, *return error and construct empty iterator*. 74 Constructor kfunc is marked with KF_ITER_NEW flag. 75 76 - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state 77 and produces an element. Next method should always return a pointer. The 78 contract between BPF verifier is that next method *guarantees* that it 79 will eventually return NULL when elements are exhausted. Once NULL is 80 returned, subsequent next calls *should keep returning NULL*. Next method 81 is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as 82 NULL-returning kfunc, of course). 83 84 - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if 85 constructor failed or next returned nothing. Destructor frees up any 86 resources and marks stack space used by `struct bpf_iter_<type>` as usable 87 for something else. Destructor is marked with KF_ITER_DESTROY flag. 88 89Any open-coded BPF iterator implementation has to implement at least these 90three methods. It is enforced that for any given type of iterator only 91applicable constructor/destructor/next are callable. I.e., verifier ensures 92you can't pass number iterator state into, say, cgroup iterator's next method. 93 94From a 10,000-feet BPF verification point of view, next methods are the points 95of forking a verification state, which are conceptually similar to what 96verifier is doing when validating conditional jumps. Verifier is branching out 97`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL 98(iteration is done) and non-NULL (new element is returned). NULL is simulated 99first and is supposed to reach exit without looping. After that non-NULL case 100is validated and it either reaches exit (for trivial examples with no real 101loop), or reaches another `call bpf_iter_<type>_next` instruction with the 102state equivalent to already (partially) validated one. State equivalency at 103that point means we technically are going to be looping forever without 104"breaking out" out of established "state envelope" (i.e., subsequent 105iterations don't add any new knowledge or constraints to the verifier state, 106so running 1, 2, 10, or a million of them doesn't matter). But taking into 107account the contract stating that iterator next method *has to* return NULL 108eventually, we can conclude that loop body is safe and will eventually 109terminate. Given we validated logic outside of the loop (NULL case), and 110concluded that loop body is safe (though potentially looping many times), 111verifier can claim safety of the overall program logic. 112 113------------------------ 114BPF Iterators Motivation 115------------------------ 116 117There are a few existing ways to dump kernel data into user space. The most 118popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps 119all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink 120sockets in the system. However, their output format tends to be fixed, and if 121users want more information about these sockets, they have to patch the kernel, 122which often takes time to publish upstream and release. The same is true for popular 123tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any 124additional information needs a kernel patch. 125 126To solve this problem, the `drgn 127<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to 128dig out the kernel data with no kernel change. However, the main drawback for 129drgn is performance, as it cannot do pointer tracing inside the kernel. In 130addition, drgn cannot validate a pointer value and may read invalid data if the 131pointer becomes invalid inside the kernel. 132 133The BPF iterator solves the above problem by providing flexibility on what data 134(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel 135data object. 136 137---------------------- 138How BPF Iterators Work 139---------------------- 140 141A BPF iterator is a type of BPF program that allows users to iterate over 142specific types of kernel objects. Unlike traditional BPF tracing programs that 143allow users to define callbacks that are invoked at particular points of 144execution in the kernel, BPF iterators allow users to define callbacks that 145should be executed for every entry in a variety of kernel data structures. 146 147For example, users can define a BPF iterator that iterates over every task on 148the system and dumps the total amount of CPU runtime currently used by each of 149them. Another BPF task iterator may instead dump the cgroup information for each 150task. Such flexibility is the core value of BPF iterators. 151 152A BPF program is always loaded into the kernel at the behest of a user space 153process. A user space process loads a BPF program by opening and initializing 154the program skeleton as required and then invoking a syscall to have the BPF 155program verified and loaded by the kernel. 156 157In traditional tracing programs, a program is activated by having user space 158obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once 159activated, the program callback will be invoked whenever the tracepoint is 160triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the 161program is obtained using ``bpf_link_create()``, and the program callback is 162invoked by issuing system calls from user space. 163 164Next, let us see how you can use the iterators to iterate on kernel objects and 165read data. 166 167------------------------ 168How to Use BPF iterators 169------------------------ 170 171BPF selftests are a great resource to illustrate how to use the iterators. In 172this section, we’ll walk through a BPF selftest which shows how to load and use 173a BPF iterator program. To begin, we’ll look at `bpf_iter.c 174<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_, 175which illustrates how to load and trigger BPF iterators on the user space side. 176Later, we’ll look at a BPF program that runs in kernel space. 177 178Loading a BPF iterator in the kernel from user space typically involves the 179following steps: 180 181* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel 182 has verified and loaded the program, it returns a file descriptor (fd) to user 183 space. 184* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()`` 185 specified with the BPF program file descriptor received from the kernel. 186* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the 187 ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2. 188* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is 189 available. 190* Close the iterator fd using ``close(bpf_iter_fd)``. 191* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again. 192 193The following are a few examples of selftest BPF iterator programs: 194 195* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ 196* `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_ 197* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ 198 199Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: 200 201Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h 202<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_. 203Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>`` 204represents a BPF iterator. The suffix ``<iter_name>`` represents the type of 205iterator. 206 207:: 208 209 struct bpf_iter__task_file { 210 union { 211 struct bpf_iter_meta *meta; 212 }; 213 union { 214 struct task_struct *task; 215 }; 216 u32 fd; 217 union { 218 struct file *file; 219 }; 220 }; 221 222In the above code, the field 'meta' contains the metadata, which is the same for 223all BPF iterator programs. The rest of the fields are specific to different 224iterators. For example, for task_file iterators, the kernel layer provides the 225'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference 226counted 227<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_, 228so they won't go away when the BPF program runs. 229 230Here is a snippet from the ``bpf_iter_task_file.c`` file: 231 232:: 233 234 SEC("iter/task_file") 235 int dump_task_file(struct bpf_iter__task_file *ctx) 236 { 237 struct seq_file *seq = ctx->meta->seq; 238 struct task_struct *task = ctx->task; 239 struct file *file = ctx->file; 240 __u32 fd = ctx->fd; 241 242 if (task == NULL || file == NULL) 243 return 0; 244 245 if (ctx->meta->seq_num == 0) { 246 count = 0; 247 BPF_SEQ_PRINTF(seq, " tgid gid fd file\n"); 248 } 249 250 if (tgid == task->tgid && task->tgid != task->pid) 251 count++; 252 253 if (last_tgid != task->tgid) { 254 last_tgid = task->tgid; 255 unique_tgid_count++; 256 } 257 258 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, 259 (long)file->f_op); 260 return 0; 261 } 262 263In the above example, the section name ``SEC(iter/task_file)``, indicates that 264the program is a BPF iterator program to iterate all files from all tasks. The 265context of the program is ``bpf_iter__task_file`` struct. 266 267The user space program invokes the BPF iterator program running in the kernel 268by issuing a ``read()`` syscall. Once invoked, the BPF 269program can export data to user space using a variety of BPF helper functions. 270You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or 271``bpf_seq_write()`` function based on whether you need formatted output or just 272binary data, respectively. For binary-encoded data, the user space applications 273can process the data from ``bpf_seq_write()`` as needed. For the formatted data, 274you can use ``cat <path>`` to print the results similar to ``cat 275/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later, 276use ``rm -f <path>`` to remove the pinned iterator. 277 278For example, you can use the following command to create a BPF iterator from the 279``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route`` 280path: 281 282:: 283 284 $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route 285 286And then print out the results using the following command: 287 288:: 289 290 $ cat /sys/fs/bpf/my_route 291 292 293------------------------------------------------------- 294Implement Kernel Support for BPF Iterator Program Types 295------------------------------------------------------- 296 297To implement a BPF iterator in the kernel, the developer must make a one-time 298change to the following key data structure defined in the `bpf.h 299<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_ 300file. 301 302:: 303 304 struct bpf_iter_reg { 305 const char *target; 306 bpf_iter_attach_target_t attach_target; 307 bpf_iter_detach_target_t detach_target; 308 bpf_iter_show_fdinfo_t show_fdinfo; 309 bpf_iter_fill_link_info_t fill_link_info; 310 bpf_iter_get_func_proto_t get_func_proto; 311 u32 ctx_arg_info_size; 312 u32 feature; 313 struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; 314 const struct bpf_iter_seq_info *seq_info; 315 }; 316 317After filling the data structure fields, call ``bpf_iter_reg_target()`` to 318register the iterator to the main BPF iterator subsystem. 319 320The following is the breakdown for each field in struct ``bpf_iter_reg``. 321 322.. list-table:: 323 :widths: 25 50 324 :header-rows: 1 325 326 * - Fields 327 - Description 328 * - target 329 - Specifies the name of the BPF iterator. For example: ``bpf_map``, 330 ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel. 331 * - attach_target and detach_target 332 - Allows for target specific ``link_create`` action since some targets 333 may need special processing. Called during the user space link_create stage. 334 * - show_fdinfo and fill_link_info 335 - Called to fill target specific information when user tries to get link 336 info associated with the iterator. 337 * - get_func_proto 338 - Permits a BPF iterator to access BPF helpers specific to the iterator. 339 * - ctx_arg_info_size and ctx_arg_info 340 - Specifies the verifier states for BPF program arguments associated with 341 the bpf iterator. 342 * - feature 343 - Specifies certain action requests in the kernel BPF iterator 344 infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means 345 that the kernel function cond_resched() is called to avoid other kernel 346 subsystem (e.g., rcu) misbehaving. 347 * - seq_info 348 - Specifies the set of seq operations for the BPF iterator and helpers to 349 initialize/free the private data for the corresponding ``seq_file``. 350 351`Click here 352<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_ 353to see an implementation of the ``task_vma`` BPF iterator in the kernel. 354 355--------------------------------- 356Parameterizing BPF Task Iterators 357--------------------------------- 358 359By default, BPF iterators walk through all the objects of the specified types 360(processes, cgroups, maps, etc.) across the entire system to read relevant 361kernel data. But often, there are cases where we only care about a much smaller 362subset of iterable kernel objects, such as only iterating tasks within a 363specific process. Therefore, BPF iterator programs support filtering out objects 364from iteration by allowing user space to configure the iterator program when it 365is attached. 366 367-------------------------- 368BPF Task Iterator Program 369-------------------------- 370 371The following code is a BPF iterator program to print files and task information 372through the ``seq_file`` of the iterator. It is a standard BPF iterator program 373that visits every file of an iterator. We will use this BPF program in our 374example later. 375 376:: 377 378 #include <vmlinux.h> 379 #include <bpf/bpf_helpers.h> 380 381 char _license[] SEC("license") = "GPL"; 382 383 SEC("iter/task_file") 384 int dump_task_file(struct bpf_iter__task_file *ctx) 385 { 386 struct seq_file *seq = ctx->meta->seq; 387 struct task_struct *task = ctx->task; 388 struct file *file = ctx->file; 389 __u32 fd = ctx->fd; 390 if (task == NULL || file == NULL) 391 return 0; 392 if (ctx->meta->seq_num == 0) { 393 BPF_SEQ_PRINTF(seq, " tgid pid fd file\n"); 394 } 395 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, 396 (long)file->f_op); 397 return 0; 398 } 399 400---------------------------------------- 401Creating a File Iterator with Parameters 402---------------------------------------- 403 404Now, let us look at how to create an iterator that includes only files of a 405process. 406 407First, fill the ``bpf_iter_attach_opts`` struct as shown below: 408 409:: 410 411 LIBBPF_OPTS(bpf_iter_attach_opts, opts); 412 union bpf_iter_link_info linfo; 413 memset(&linfo, 0, sizeof(linfo)); 414 linfo.task.pid = getpid(); 415 opts.link_info = &linfo; 416 opts.link_info_len = sizeof(linfo); 417 418``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator 419that only includes opened files for the process with the specified ``pid``. In 420this example, we will only be iterating files for our process. If 421``linfo.task.pid`` is zero, the iterator will visit every opened file of every 422process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator 423that visits opened files of a specific thread, not a process. In this example, 424``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a 425separate file descriptor table. In most circumstances, all process threads share 426a single file descriptor table. 427 428Now, in the userspace program, pass the pointer of struct to the 429``bpf_program__attach_iter()``. 430 431:: 432 433 link = bpf_program__attach_iter(prog, &opts); 434 iter_fd = bpf_iter_create(bpf_link__fd(link)); 435 436If both *tid* and *pid* are zero, an iterator created from this struct 437``bpf_iter_attach_opts`` will include every opened file of every task in the 438system (in the namespace, actually.) It is the same as passing a NULL as the 439second argument to ``bpf_program__attach_iter()``. 440 441The whole program looks like the following code: 442 443:: 444 445 #include <stdio.h> 446 #include <unistd.h> 447 #include <bpf/bpf.h> 448 #include <bpf/libbpf.h> 449 #include "bpf_iter_task_ex.skel.h" 450 451 static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts) 452 { 453 struct bpf_link *link; 454 char buf[16] = {}; 455 int iter_fd = -1, len; 456 int ret = 0; 457 458 link = bpf_program__attach_iter(prog, opts); 459 if (!link) { 460 fprintf(stderr, "bpf_program__attach_iter() fails\n"); 461 return -1; 462 } 463 iter_fd = bpf_iter_create(bpf_link__fd(link)); 464 if (iter_fd < 0) { 465 fprintf(stderr, "bpf_iter_create() fails\n"); 466 ret = -1; 467 goto free_link; 468 } 469 /* not check contents, but ensure read() ends without error */ 470 while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) { 471 buf[len] = 0; 472 printf("%s", buf); 473 } 474 printf("\n"); 475 free_link: 476 if (iter_fd >= 0) 477 close(iter_fd); 478 bpf_link__destroy(link); 479 return 0; 480 } 481 482 static void test_task_file(void) 483 { 484 LIBBPF_OPTS(bpf_iter_attach_opts, opts); 485 struct bpf_iter_task_ex *skel; 486 union bpf_iter_link_info linfo; 487 skel = bpf_iter_task_ex__open_and_load(); 488 if (skel == NULL) 489 return; 490 memset(&linfo, 0, sizeof(linfo)); 491 linfo.task.pid = getpid(); 492 opts.link_info = &linfo; 493 opts.link_info_len = sizeof(linfo); 494 printf("PID %d\n", getpid()); 495 do_read_opts(skel->progs.dump_task_file, &opts); 496 bpf_iter_task_ex__destroy(skel); 497 } 498 499 int main(int argc, const char * const * argv) 500 { 501 test_task_file(); 502 return 0; 503 } 504 505The following lines are the output of the program. 506:: 507 508 PID 1859 509 510 tgid pid fd file 511 1859 1859 0 ffffffff82270aa0 512 1859 1859 1 ffffffff82270aa0 513 1859 1859 2 ffffffff82270aa0 514 1859 1859 3 ffffffff82272980 515 1859 1859 4 ffffffff8225e120 516 1859 1859 5 ffffffff82255120 517 1859 1859 6 ffffffff82254f00 518 1859 1859 7 ffffffff82254d80 519 1859 1859 8 ffffffff8225abe0 520 521------------------ 522Without Parameters 523------------------ 524 525Let us look at how a BPF iterator without parameters skips files of other 526processes in the system. In this case, the BPF program has to check the pid or 527the tid of tasks, or it will receive every opened file in the system (in the 528current *pid* namespace, actually). So, we usually add a global variable in the 529BPF program to pass a *pid* to the BPF program. 530 531The BPF program would look like the following block. 532 533 :: 534 535 ...... 536 int target_pid = 0; 537 538 SEC("iter/task_file") 539 int dump_task_file(struct bpf_iter__task_file *ctx) 540 { 541 ...... 542 if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */ 543 return 0; 544 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, 545 (long)file->f_op); 546 return 0; 547 } 548 549The user space program would look like the following block: 550 551 :: 552 553 ...... 554 static void test_task_file(void) 555 { 556 ...... 557 skel = bpf_iter_task_ex__open_and_load(); 558 if (skel == NULL) 559 return; 560 skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */ 561 memset(&linfo, 0, sizeof(linfo)); 562 linfo.task.pid = getpid(); 563 opts.link_info = &linfo; 564 opts.link_info_len = sizeof(linfo); 565 ...... 566 } 567 568``target_pid`` is a global variable in the BPF program. The user space program 569should initialize the variable with a process ID to skip opened files of other 570processes in the BPF program. When you parametrize a BPF iterator, the iterator 571calls the BPF program fewer times which can save significant resources. 572 573--------------------------- 574Parametrizing VMA Iterators 575--------------------------- 576 577By default, a BPF VMA iterator includes every VMA in every process. However, 578you can still specify a process or a thread to include only its VMAs. Unlike 579files, a thread can not have a separate address space (since Linux 2.6.0-test6). 580Here, using *tid* makes no difference from using *pid*. 581 582---------------------------- 583Parametrizing Task Iterators 584---------------------------- 585 586A BPF task iterator with *pid* includes all tasks (threads) of a process. The 587BPF program receives these tasks one after another. You can specify a BPF task 588iterator with *tid* parameter to include only the tasks that match the given 589*tid*. 590