1.. _sched-ext:
2
3==========================
4Extensible Scheduler Class
5==========================
6
7sched_ext is a scheduler class whose behavior can be defined by a set of BPF
8programs - the BPF scheduler.
9
10* sched_ext exports a full scheduling interface so that any scheduling
11  algorithm can be implemented on top.
12
13* The BPF scheduler can group CPUs however it sees fit and schedule them
14  together, as tasks aren't tied to specific CPUs at the time of wakeup.
15
16* The BPF scheduler can be turned on and off dynamically anytime.
17
18* The system integrity is maintained no matter what the BPF scheduler does.
19  The default scheduling behavior is restored anytime an error is detected,
20  a runnable task stalls, or on invoking the SysRq key sequence
21  `SysRq-S`.
22
23* When the BPF scheduler triggers an error, debug information is dumped to
24  aid debugging. The debug dump is passed to and printed out by the
25  scheduler binary. The debug dump can also be accessed through the
26  `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D`
27  triggers a debug dump. This doesn't terminate the BPF scheduler and can
28  only be read through the tracepoint.
29
30Switching to and from sched_ext
31===============================
32
33``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
34``tools/sched_ext`` contains the example schedulers. The following config
35options should be enabled to use sched_ext:
36
37.. code-block:: none
38
39    CONFIG_BPF=y
40    CONFIG_SCHED_CLASS_EXT=y
41    CONFIG_BPF_SYSCALL=y
42    CONFIG_BPF_JIT=y
43    CONFIG_DEBUG_INFO_BTF=y
44    CONFIG_BPF_JIT_ALWAYS_ON=y
45    CONFIG_BPF_JIT_DEFAULT_ON=y
46    CONFIG_PAHOLE_HAS_SPLIT_BTF=y
47    CONFIG_PAHOLE_HAS_BTF_TAG=y
48
49sched_ext is used only when the BPF scheduler is loaded and running.
50
51If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
52treated as ``SCHED_NORMAL`` and scheduled by the fair-class scheduler until the
53BPF scheduler is loaded.
54
55When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
56in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
57``SCHED_EXT`` tasks are scheduled by sched_ext.
58
59However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
60set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
61by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
62``SCHED_IDLE`` policies are scheduled by the fair-class scheduler.
63
64Terminating the sched_ext scheduler program, triggering `SysRq-S`, or
65detection of any internal error including stalled runnable tasks aborts the
66BPF scheduler and reverts all tasks back to the fair-class scheduler.
67
68.. code-block:: none
69
70    # make -j16 -C tools/sched_ext
71    # tools/sched_ext/build/bin/scx_simple
72    local=0 global=3
73    local=5 global=24
74    local=9 global=44
75    local=13 global=56
76    local=17 global=72
77    ^CEXIT: BPF scheduler unregistered
78
79The current status of the BPF scheduler can be determined as follows:
80
81.. code-block:: none
82
83    # cat /sys/kernel/sched_ext/state
84    enabled
85    # cat /sys/kernel/sched_ext/root/ops
86    simple
87
88You can check if any BPF scheduler has ever been loaded since boot by examining
89this monotonically incrementing counter (a value of zero indicates that no BPF
90scheduler has been loaded):
91
92.. code-block:: none
93
94    # cat /sys/kernel/sched_ext/enable_seq
95    1
96
97``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
98detailed information:
99
100.. code-block:: none
101
102    # tools/sched_ext/scx_show_state.py
103    ops           : simple
104    enabled       : 1
105    switching_all : 1
106    switched_all  : 1
107    enable_state  : enabled (2)
108    bypass_depth  : 0
109    nr_rejected   : 0
110    enable_seq    : 1
111
112Whether a given task is on sched_ext can be determined as follows:
113
114.. code-block:: none
115
116    # grep ext /proc/self/sched
117    ext.enabled                                  :                    1
118
119The Basics
120==========
121
122Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
123programs that implement ``struct sched_ext_ops``. The only mandatory field
124is ``ops.name`` which must be a valid BPF object name. All operations are
125optional. The following modified excerpt is from
126``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
127
128.. code-block:: c
129
130    /*
131     * Decide which CPU a task should be migrated to before being
132     * enqueued (either at wakeup, fork time, or exec time). If an
133     * idle core is found by the default ops.select_cpu() implementation,
134     * then insert the task directly into SCX_DSQ_LOCAL and skip the
135     * ops.enqueue() callback.
136     *
137     * Note that this implementation has exactly the same behavior as the
138     * default ops.select_cpu implementation. The behavior of the scheduler
139     * would be exactly same if the implementation just didn't define the
140     * simple_select_cpu() struct_ops prog.
141     */
142    s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
143                       s32 prev_cpu, u64 wake_flags)
144    {
145            s32 cpu;
146            /* Need to initialize or the BPF verifier will reject the program */
147            bool direct = false;
148
149            cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
150
151            if (direct)
152                    scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
153
154            return cpu;
155    }
156
157    /*
158     * Do a direct insertion of a task to the global DSQ. This ops.enqueue()
159     * callback will only be invoked if we failed to find a core to insert
160     * into in ops.select_cpu() above.
161     *
162     * Note that this implementation has exactly the same behavior as the
163     * default ops.enqueue implementation, which just dispatches the task
164     * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
165     * if the implementation just didn't define the simple_enqueue struct_ops
166     * prog.
167     */
168    void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
169    {
170            scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
171    }
172
173    s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
174    {
175            /*
176             * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
177             * SCHED_BATCH tasks should use sched_ext.
178             */
179            return 0;
180    }
181
182    void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
183    {
184            exit_type = ei->type;
185    }
186
187    SEC(".struct_ops")
188    struct sched_ext_ops simple_ops = {
189            .select_cpu             = (void *)simple_select_cpu,
190            .enqueue                = (void *)simple_enqueue,
191            .init                   = (void *)simple_init,
192            .exit                   = (void *)simple_exit,
193            .name                   = "simple",
194    };
195
196Dispatch Queues
197---------------
198
199To match the impedance between the scheduler core and the BPF scheduler,
200sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
201priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
202and one local DSQ per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
203an arbitrary number of DSQs using ``scx_bpf_create_dsq()`` and
204``scx_bpf_destroy_dsq()``.
205
206A CPU always executes a task from its local DSQ. A task is "inserted" into a
207DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ.
208
209When a CPU is looking for the next task to run, if the local DSQ is not
210empty, the first task is picked. Otherwise, the CPU tries to move a task
211from the global DSQ. If that doesn't yield a runnable task either,
212``ops.dispatch()`` is invoked.
213
214Scheduling Cycle
215----------------
216
217The following briefly shows how a waking task is scheduled and executed.
218
2191. When a task is waking up, ``ops.select_cpu()`` is the first operation
220   invoked. This serves two purposes. First, CPU selection optimization
221   hint. Second, waking up the selected CPU if idle.
222
223   The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
224   binding. The actual decision is made at the last step of scheduling.
225   However, there is a small performance gain if the CPU
226   ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
227
228   A side-effect of selecting a CPU is waking it up from idle. While a BPF
229   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
230   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
231
232   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
233   by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
234   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
235   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
236   Additionally, inserting directly from ``ops.select_cpu()`` will cause the
237   ``ops.enqueue()`` callback to be skipped.
238
239   Note that the scheduler core will ignore an invalid CPU selection, for
240   example, if it's outside the allowed cpumask of the task.
241
2422. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
243   task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
244   can make one of the following decisions:
245
246   * Immediately insert the task into either the global or a local DSQ by
247     calling ``scx_bpf_dsq_insert()`` with one of the following options:
248     ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``.
249
250   * Immediately insert the task into a custom DSQ by calling
251     ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63.
252
253   * Queue the task on the BPF side.
254
2553. When a CPU is ready to schedule, it first looks at its local DSQ. If
256   empty, it then looks at the global DSQ. If there still isn't a task to
257   run, ``ops.dispatch()`` is invoked which can use the following two
258   functions to populate the local DSQ.
259
260   * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be
261     used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
262     ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()``
263     currently can't be called with BPF locks held, this is being worked on
264     and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion
265     rather than performing them immediately. There can be up to
266     ``ops.dispatch_max_batch`` pending tasks.
267
268   * ``scx_bpf_move_to_local()`` moves a task from the specified non-local
269     DSQ to the dispatching DSQ. This function cannot be called with any BPF
270     locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions
271     tasks before trying to move from the specified DSQ.
272
2734. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
274   the CPU runs the first one. If empty, the following steps are taken:
275
276   * Try to move from the global DSQ. If successful, run the task.
277
278   * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
279
280   * If the previous task is an SCX task and still runnable, keep executing
281     it (see ``SCX_OPS_ENQ_LAST``).
282
283   * Go idle.
284
285Note that the BPF scheduler can always choose to dispatch tasks immediately
286in ``ops.enqueue()`` as illustrated in the above simple example. If only the
287built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
288a task is never queued on the BPF scheduler and both the local and global
289DSQs are executed automatically.
290
291``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use
292``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as
293``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
294dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See
295the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c``
296for more information.
297
298Task Lifecycle
299--------------
300
301The following pseudo-code summarizes the entire lifecycle of a task managed
302by a sched_ext scheduler:
303
304.. code-block:: c
305
306    ops.init_task();            /* A new task is created */
307    ops.enable();               /* Enable BPF scheduling for the task */
308
309    while (task in SCHED_EXT) {
310        if (task can migrate)
311            ops.select_cpu();   /* Called on wakeup (optimization) */
312
313        ops.runnable();         /* Task becomes ready to run */
314
315        while (task is runnable) {
316            if (task is not in a DSQ) {
317                ops.enqueue();  /* Task can be added to a DSQ */
318
319                /* A CPU becomes available */
320
321                ops.dispatch(); /* Task is moved to a local DSQ */
322            }
323            ops.running();      /* Task starts running on its assigned CPU */
324            ops.tick();         /* Called every 1/HZ seconds */
325            ops.stopping();     /* Task stops running (time slice expires or wait) */
326        }
327
328        ops.quiescent();        /* Task releases its assigned CPU (wait) */
329    }
330
331    ops.disable();              /* Disable BPF scheduling for the task */
332    ops.exit_task();            /* Task is destroyed */
333
334Where to Look
335=============
336
337* ``include/linux/sched/ext.h`` defines the core data structures, ops table
338  and constants.
339
340* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
341  The functions prefixed with ``scx_bpf_`` can be called from the BPF
342  scheduler.
343
344* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
345
346  * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
347    custom DSQ.
348
349  * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
350    levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
351
352ABI Instability
353===============
354
355The APIs provided by sched_ext to BPF schedulers programs have no stability
356guarantees. This includes the ops table callbacks and constants defined in
357``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
358``kernel/sched/ext.c``.
359
360While we will attempt to provide a relatively stable API surface when
361possible, they are subject to change without warning between kernel
362versions.
363