xref: /linux/Documentation/bpf/bpf_iterators.rst (revision ab93e0dd72c37d378dd936f031ffb83ff2bd87ce)
1=============
2BPF Iterators
3=============
4
5--------
6Overview
7--------
8
9BPF supports two separate entities collectively known as "BPF iterators": BPF
10iterator *program type* and *open-coded* BPF iterators. The former is
11a stand-alone BPF program type which, when attached and activated by user,
12will be called once for each entity (task_struct, cgroup, etc) that is being
13iterated. The latter is a set of BPF-side APIs implementing iterator
14functionality and available across multiple BPF program types. Open-coded
15iterators provide similar functionality to BPF iterator programs, but gives
16more flexibility and control to all other BPF program types. BPF iterator
17programs, on the other hand, can be used to implement anonymous or BPF
18FS-mounted special files, whose contents are generated by attached BPF iterator
19program, backed by seq_file functionality. Both are useful depending on
20specific needs.
21
22When adding a new BPF iterator program, it is expected that similar
23functionality will be added as open-coded iterator for maximum flexibility.
24It's also expected that iteration logic and code will be maximally shared and
25reused between two iterator API surfaces.
26
27------------------------
28Open-coded BPF Iterators
29------------------------
30
31Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
32(constructor, next element fetch, destructor) and iterator-specific type
33describing on-the-stack iterator state, which is guaranteed by the BPF
34verifier to not be tampered with outside of the corresponding
35constructor/destructor/next APIs.
36
37Each kind of open-coded BPF iterator has its own associated
38struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
39bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
40small enough to fit on BPF stack. For performance reasons its best to avoid
41dynamic memory allocation for iterator state and size the state struct big
42enough to fit everything necessary. But if necessary, dynamic memory
43allocation is a way to bypass BPF stack limitations. Note, state struct size
44is part of iterator's user-visible API, so changing it will break backwards
45compatibility, so be deliberate about designing it.
46
47All kfuncs (constructor, next, destructor) have to be named consistently as
48bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
49type, and iterator state should be represented as a matching
50`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
51a pointer to this `struct bpf_iter_<type>` as the very first argument.
52
53Additionally:
54  - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
55    number of arguments. Return type is not enforced either.
56  - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
57    type and should have exactly one argument: `struct bpf_iter_<type> *`
58    (const/volatile/restrict and typedefs are ignored).
59  - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
60    should have exactly one argument, similar to the next method.
61  - `struct bpf_iter_<type>` size is enforced to be positive and
62    a multiple of 8 bytes (to fit stack slots correctly).
63
64Such strictness and consistency allows to build generic helpers abstracting
65important, but boilerplate, details to be able to use open-coded iterators
66effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
67enforced at kfunc registration point by the kernel.
68
69Constructor/next/destructor implementation contract is as follows:
70  - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
71    the stack. If any of the input arguments are invalid, constructor should
72    make sure to still initialize it such that subsequent next() calls will
73    return NULL. I.e., on error, *return error and construct empty iterator*.
74    Constructor kfunc is marked with KF_ITER_NEW flag.
75
76  - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
77    and produces an element. Next method should always return a pointer. The
78    contract between BPF verifier is that next method *guarantees* that it
79    will eventually return NULL when elements are exhausted. Once NULL is
80    returned, subsequent next calls *should keep returning NULL*. Next method
81    is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
82    NULL-returning kfunc, of course).
83
84  - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
85    constructor failed or next returned nothing.  Destructor frees up any
86    resources and marks stack space used by `struct bpf_iter_<type>` as usable
87    for something else. Destructor is marked with KF_ITER_DESTROY flag.
88
89Any open-coded BPF iterator implementation has to implement at least these
90three methods. It is enforced that for any given type of iterator only
91applicable constructor/destructor/next are callable. I.e., verifier ensures
92you can't pass number iterator state into, say, cgroup iterator's next method.
93
94From a 10,000-feet BPF verification point of view, next methods are the points
95of forking a verification state, which are conceptually similar to what
96verifier is doing when validating conditional jumps. Verifier is branching out
97`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
98(iteration is done) and non-NULL (new element is returned). NULL is simulated
99first and is supposed to reach exit without looping. After that non-NULL case
100is validated and it either reaches exit (for trivial examples with no real
101loop), or reaches another `call bpf_iter_<type>_next` instruction with the
102state equivalent to already (partially) validated one. State equivalency at
103that point means we technically are going to be looping forever without
104"breaking out" out of established "state envelope" (i.e., subsequent
105iterations don't add any new knowledge or constraints to the verifier state,
106so running 1, 2, 10, or a million of them doesn't matter). But taking into
107account the contract stating that iterator next method *has to* return NULL
108eventually, we can conclude that loop body is safe and will eventually
109terminate. Given we validated logic outside of the loop (NULL case), and
110concluded that loop body is safe (though potentially looping many times),
111verifier can claim safety of the overall program logic.
112
113------------------------
114BPF Iterators Motivation
115------------------------
116
117There are a few existing ways to dump kernel data into user space. The most
118popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
119all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
120sockets in the system. However, their output format tends to be fixed, and if
121users want more information about these sockets, they have to patch the kernel,
122which often takes time to publish upstream and release. The same is true for popular
123tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
124additional information needs a kernel patch.
125
126To solve this problem, the `drgn
127<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
128dig out the kernel data with no kernel change. However, the main drawback for
129drgn is performance, as it cannot do pointer tracing inside the kernel. In
130addition, drgn cannot validate a pointer value and may read invalid data if the
131pointer becomes invalid inside the kernel.
132
133The BPF iterator solves the above problem by providing flexibility on what data
134(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
135data object.
136
137----------------------
138How BPF Iterators Work
139----------------------
140
141A BPF iterator is a type of BPF program that allows users to iterate over
142specific types of kernel objects. Unlike traditional BPF tracing programs that
143allow users to define callbacks that are invoked at particular points of
144execution in the kernel, BPF iterators allow users to define callbacks that
145should be executed for every entry in a variety of kernel data structures.
146
147For example, users can define a BPF iterator that iterates over every task on
148the system and dumps the total amount of CPU runtime currently used by each of
149them. Another BPF task iterator may instead dump the cgroup information for each
150task. Such flexibility is the core value of BPF iterators.
151
152A BPF program is always loaded into the kernel at the behest of a user space
153process. A user space process loads a BPF program by opening and initializing
154the program skeleton as required and then invoking a syscall to have the BPF
155program verified and loaded by the kernel.
156
157In traditional tracing programs, a program is activated by having user space
158obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
159activated, the program callback will be invoked whenever the tracepoint is
160triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
161program is obtained using ``bpf_link_create()``, and the program callback is
162invoked by issuing system calls from user space.
163
164Next, let us see how you can use the iterators to iterate on kernel objects and
165read data.
166
167------------------------
168How to Use BPF iterators
169------------------------
170
171BPF selftests are a great resource to illustrate how to use the iterators. In
172this section, we’ll walk through a BPF selftest which shows how to load and use
173a BPF iterator program.   To begin, we’ll look at `bpf_iter.c
174<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
175which illustrates how to load and trigger BPF iterators on the user space side.
176Later, we’ll look at a BPF program that runs in kernel space.
177
178Loading a BPF iterator in the kernel from user space typically involves the
179following steps:
180
181* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
182  has verified and loaded the program, it returns a file descriptor (fd) to user
183  space.
184* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
185  specified with the BPF program file descriptor received from the kernel.
186* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
187  ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
188* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
189  available.
190* Close the iterator fd using ``close(bpf_iter_fd)``.
191* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
192
193The following are a few examples of selftest BPF iterator programs:
194
195* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
196* `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_
197* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
198
199Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
200
201Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
202<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
203Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
204represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
205iterator.
206
207::
208
209    struct bpf_iter__task_file {
210            union {
211                struct bpf_iter_meta *meta;
212            };
213            union {
214                struct task_struct *task;
215            };
216            u32 fd;
217            union {
218                struct file *file;
219            };
220    };
221
222In the above code, the field 'meta' contains the metadata, which is the same for
223all BPF iterator programs. The rest of the fields are specific to different
224iterators. For example, for task_file iterators, the kernel layer provides the
225'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
226counted
227<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
228so they won't go away when the BPF program runs.
229
230Here is a snippet from the  ``bpf_iter_task_file.c`` file:
231
232::
233
234  SEC("iter/task_file")
235  int dump_task_file(struct bpf_iter__task_file *ctx)
236  {
237    struct seq_file *seq = ctx->meta->seq;
238    struct task_struct *task = ctx->task;
239    struct file *file = ctx->file;
240    __u32 fd = ctx->fd;
241
242    if (task == NULL || file == NULL)
243      return 0;
244
245    if (ctx->meta->seq_num == 0) {
246      count = 0;
247      BPF_SEQ_PRINTF(seq, "    tgid      gid       fd      file\n");
248    }
249
250    if (tgid == task->tgid && task->tgid != task->pid)
251      count++;
252
253    if (last_tgid != task->tgid) {
254      last_tgid = task->tgid;
255      unique_tgid_count++;
256    }
257
258    BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
259            (long)file->f_op);
260    return 0;
261  }
262
263In the above example, the section name ``SEC(iter/task_file)``, indicates that
264the program is a BPF iterator program to iterate all files from all tasks. The
265context of the program is ``bpf_iter__task_file`` struct.
266
267The user space program invokes the BPF iterator program running in the kernel
268by issuing a ``read()`` syscall. Once invoked, the BPF
269program can export data to user space using a variety of BPF helper functions.
270You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
271``bpf_seq_write()`` function based on whether you need formatted output or just
272binary data, respectively. For binary-encoded data, the user space applications
273can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
274you can use ``cat <path>`` to print the results similar to ``cat
275/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
276use  ``rm -f <path>`` to remove the pinned iterator.
277
278For example, you can use the following command to create a BPF iterator from the
279``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
280path:
281
282::
283
284  $ bpftool iter pin ./bpf_iter_ipv6_route.o  /sys/fs/bpf/my_route
285
286And then print out the results using the following command:
287
288::
289
290  $ cat /sys/fs/bpf/my_route
291
292
293-------------------------------------------------------
294Implement Kernel Support for BPF Iterator Program Types
295-------------------------------------------------------
296
297To implement a BPF iterator in the kernel, the developer must make a one-time
298change to the following key data structure defined in the `bpf.h
299<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
300file.
301
302::
303
304  struct bpf_iter_reg {
305            const char *target;
306            bpf_iter_attach_target_t attach_target;
307            bpf_iter_detach_target_t detach_target;
308            bpf_iter_show_fdinfo_t show_fdinfo;
309            bpf_iter_fill_link_info_t fill_link_info;
310            bpf_iter_get_func_proto_t get_func_proto;
311            u32 ctx_arg_info_size;
312            u32 feature;
313            struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
314            const struct bpf_iter_seq_info *seq_info;
315  };
316
317After filling the data structure fields, call ``bpf_iter_reg_target()`` to
318register the iterator to the main BPF iterator subsystem.
319
320The following is the breakdown for each field in struct ``bpf_iter_reg``.
321
322.. list-table::
323   :widths: 25 50
324   :header-rows: 1
325
326   * - Fields
327     - Description
328   * - target
329     - Specifies the name of the BPF iterator. For example: ``bpf_map``,
330       ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
331   * - attach_target and detach_target
332     - Allows for target specific ``link_create`` action since some targets
333       may need special processing. Called during the user space link_create stage.
334   * - show_fdinfo and fill_link_info
335     - Called to fill target specific information when user tries to get link
336       info associated with the iterator.
337   * - get_func_proto
338     - Permits a BPF iterator to access BPF helpers specific to the iterator.
339   * - ctx_arg_info_size and ctx_arg_info
340     - Specifies the verifier states for BPF program arguments associated with
341       the bpf iterator.
342   * - feature
343     - Specifies certain action requests in the kernel BPF iterator
344       infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
345       that the kernel function cond_resched() is called to avoid other kernel
346       subsystem (e.g., rcu) misbehaving.
347   * - seq_info
348     - Specifies the set of seq operations for the BPF iterator and helpers to
349       initialize/free the private data for the corresponding ``seq_file``.
350
351`Click here
352<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
353to see an implementation of the ``task_vma`` BPF iterator in the kernel.
354
355---------------------------------
356Parameterizing BPF Task Iterators
357---------------------------------
358
359By default, BPF iterators walk through all the objects of the specified types
360(processes, cgroups, maps, etc.) across the entire system to read relevant
361kernel data. But often, there are cases where we only care about a much smaller
362subset of iterable kernel objects, such as only iterating tasks within a
363specific process. Therefore, BPF iterator programs support filtering out objects
364from iteration by allowing user space to configure the iterator program when it
365is attached.
366
367--------------------------
368BPF Task Iterator Program
369--------------------------
370
371The following code is a BPF iterator program to print files and task information
372through the ``seq_file`` of the iterator. It is a standard BPF iterator program
373that visits every file of an iterator. We will use this BPF program in our
374example later.
375
376::
377
378  #include <vmlinux.h>
379  #include <bpf/bpf_helpers.h>
380
381  char _license[] SEC("license") = "GPL";
382
383  SEC("iter/task_file")
384  int dump_task_file(struct bpf_iter__task_file *ctx)
385  {
386        struct seq_file *seq = ctx->meta->seq;
387        struct task_struct *task = ctx->task;
388        struct file *file = ctx->file;
389        __u32 fd = ctx->fd;
390        if (task == NULL || file == NULL)
391                return 0;
392        if (ctx->meta->seq_num == 0) {
393                BPF_SEQ_PRINTF(seq, "    tgid      pid       fd      file\n");
394        }
395        BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
396                        (long)file->f_op);
397        return 0;
398  }
399
400----------------------------------------
401Creating a File Iterator with Parameters
402----------------------------------------
403
404Now, let us look at how to create an iterator that includes only files of a
405process.
406
407First,  fill the ``bpf_iter_attach_opts`` struct as shown below:
408
409::
410
411  LIBBPF_OPTS(bpf_iter_attach_opts, opts);
412  union bpf_iter_link_info linfo;
413  memset(&linfo, 0, sizeof(linfo));
414  linfo.task.pid = getpid();
415  opts.link_info = &linfo;
416  opts.link_info_len = sizeof(linfo);
417
418``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
419that only includes opened files for the process with the specified ``pid``. In
420this example, we will only be iterating files for our process. If
421``linfo.task.pid`` is zero, the iterator will visit every opened file of every
422process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
423that visits opened files of a specific thread, not a process. In this example,
424``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
425separate file descriptor table. In most circumstances, all process threads share
426a single file descriptor table.
427
428Now, in the userspace program, pass the pointer of struct to the
429``bpf_program__attach_iter()``.
430
431::
432
433  link = bpf_program__attach_iter(prog, &opts);
434  iter_fd = bpf_iter_create(bpf_link__fd(link));
435
436If both *tid* and *pid* are zero, an iterator created from this struct
437``bpf_iter_attach_opts`` will include every opened file of every task in the
438system (in the namespace, actually.) It is the same as passing a NULL as the
439second argument to ``bpf_program__attach_iter()``.
440
441The whole program looks like the following code:
442
443::
444
445  #include <stdio.h>
446  #include <unistd.h>
447  #include <bpf/bpf.h>
448  #include <bpf/libbpf.h>
449  #include "bpf_iter_task_ex.skel.h"
450
451  static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
452  {
453        struct bpf_link *link;
454        char buf[16] = {};
455        int iter_fd = -1, len;
456        int ret = 0;
457
458        link = bpf_program__attach_iter(prog, opts);
459        if (!link) {
460                fprintf(stderr, "bpf_program__attach_iter() fails\n");
461                return -1;
462        }
463        iter_fd = bpf_iter_create(bpf_link__fd(link));
464        if (iter_fd < 0) {
465                fprintf(stderr, "bpf_iter_create() fails\n");
466                ret = -1;
467                goto free_link;
468        }
469        /* not check contents, but ensure read() ends without error */
470        while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
471                buf[len] = 0;
472                printf("%s", buf);
473        }
474        printf("\n");
475  free_link:
476        if (iter_fd >= 0)
477                close(iter_fd);
478        bpf_link__destroy(link);
479        return 0;
480  }
481
482  static void test_task_file(void)
483  {
484        LIBBPF_OPTS(bpf_iter_attach_opts, opts);
485        struct bpf_iter_task_ex *skel;
486        union bpf_iter_link_info linfo;
487        skel = bpf_iter_task_ex__open_and_load();
488        if (skel == NULL)
489                return;
490        memset(&linfo, 0, sizeof(linfo));
491        linfo.task.pid = getpid();
492        opts.link_info = &linfo;
493        opts.link_info_len = sizeof(linfo);
494        printf("PID %d\n", getpid());
495        do_read_opts(skel->progs.dump_task_file, &opts);
496        bpf_iter_task_ex__destroy(skel);
497  }
498
499  int main(int argc, const char * const * argv)
500  {
501        test_task_file();
502        return 0;
503  }
504
505The following lines are the output of the program.
506::
507
508  PID 1859
509
510     tgid      pid       fd      file
511     1859     1859        0 ffffffff82270aa0
512     1859     1859        1 ffffffff82270aa0
513     1859     1859        2 ffffffff82270aa0
514     1859     1859        3 ffffffff82272980
515     1859     1859        4 ffffffff8225e120
516     1859     1859        5 ffffffff82255120
517     1859     1859        6 ffffffff82254f00
518     1859     1859        7 ffffffff82254d80
519     1859     1859        8 ffffffff8225abe0
520
521------------------
522Without Parameters
523------------------
524
525Let us look at how a BPF iterator without parameters skips files of other
526processes in the system. In this case, the BPF program has to check the pid or
527the tid of tasks, or it will receive every opened file in the system (in the
528current *pid* namespace, actually). So, we usually add a global variable in the
529BPF program to pass a *pid* to the BPF program.
530
531The BPF program would look like the following block.
532
533  ::
534
535    ......
536    int target_pid = 0;
537
538    SEC("iter/task_file")
539    int dump_task_file(struct bpf_iter__task_file *ctx)
540    {
541          ......
542          if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
543                  return 0;
544          BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
545                          (long)file->f_op);
546          return 0;
547    }
548
549The user space program would look like the following block:
550
551  ::
552
553    ......
554    static void test_task_file(void)
555    {
556          ......
557          skel = bpf_iter_task_ex__open_and_load();
558          if (skel == NULL)
559                  return;
560          skel->bss->target_pid = getpid(); /* process ID.  For thread id, use gettid() */
561          memset(&linfo, 0, sizeof(linfo));
562          linfo.task.pid = getpid();
563          opts.link_info = &linfo;
564          opts.link_info_len = sizeof(linfo);
565          ......
566    }
567
568``target_pid`` is a global variable in the BPF program. The user space program
569should initialize the variable with a process ID to skip opened files of other
570processes in the BPF program. When you parametrize a BPF iterator, the iterator
571calls the BPF program fewer times which can save significant resources.
572
573---------------------------
574Parametrizing VMA Iterators
575---------------------------
576
577By default, a BPF VMA iterator includes every VMA in every process.  However,
578you can still specify a process or a thread to include only its VMAs. Unlike
579files, a thread can not have a separate address space (since Linux 2.6.0-test6).
580Here, using *tid* makes no difference from using *pid*.
581
582----------------------------
583Parametrizing Task Iterators
584----------------------------
585
586A BPF task iterator with *pid* includes all tasks (threads) of a process. The
587BPF program receives these tasks one after another. You can specify a BPF task
588iterator with *tid* parameter to include only the tasks that match the given
589*tid*.
590