1==========================
2Extensible Scheduler Class
3==========================
4
5sched_ext is a scheduler class whose behavior can be defined by a set of BPF
6programs - the BPF scheduler.
7
8* sched_ext exports a full scheduling interface so that any scheduling
9  algorithm can be implemented on top.
10
11* The BPF scheduler can group CPUs however it sees fit and schedule them
12  together, as tasks aren't tied to specific CPUs at the time of wakeup.
13
14* The BPF scheduler can be turned on and off dynamically anytime.
15
16* The system integrity is maintained no matter what the BPF scheduler does.
17  The default scheduling behavior is restored anytime an error is detected,
18  a runnable task stalls, or on invoking the SysRq key sequence
19  `SysRq-S`.
20
21* When the BPF scheduler triggers an error, debug information is dumped to
22  aid debugging. The debug dump is passed to and printed out by the
23  scheduler binary. The debug dump can also be accessed through the
24  `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D`
25  triggers a debug dump. This doesn't terminate the BPF scheduler and can
26  only be read through the tracepoint.
27
28Switching to and from sched_ext
29===============================
30
31``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
32``tools/sched_ext`` contains the example schedulers. The following config
33options should be enabled to use sched_ext:
34
35.. code-block:: none
36
37    CONFIG_BPF=y
38    CONFIG_SCHED_CLASS_EXT=y
39    CONFIG_BPF_SYSCALL=y
40    CONFIG_BPF_JIT=y
41    CONFIG_DEBUG_INFO_BTF=y
42    CONFIG_BPF_JIT_ALWAYS_ON=y
43    CONFIG_BPF_JIT_DEFAULT_ON=y
44    CONFIG_PAHOLE_HAS_SPLIT_BTF=y
45    CONFIG_PAHOLE_HAS_BTF_TAG=y
46
47sched_ext is used only when the BPF scheduler is loaded and running.
48
49If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
50treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
51loaded.
52
53When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
54in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
55``SCHED_EXT`` tasks are scheduled by sched_ext.
56
57However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
58set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
59by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
60``SCHED_IDLE`` policies are scheduled by CFS.
61
62Terminating the sched_ext scheduler program, triggering `SysRq-S`, or
63detection of any internal error including stalled runnable tasks aborts the
64BPF scheduler and reverts all tasks back to CFS.
65
66.. code-block:: none
67
68    # make -j16 -C tools/sched_ext
69    # tools/sched_ext/build/bin/scx_simple
70    local=0 global=3
71    local=5 global=24
72    local=9 global=44
73    local=13 global=56
74    local=17 global=72
75    ^CEXIT: BPF scheduler unregistered
76
77The current status of the BPF scheduler can be determined as follows:
78
79.. code-block:: none
80
81    # cat /sys/kernel/sched_ext/state
82    enabled
83    # cat /sys/kernel/sched_ext/root/ops
84    simple
85
86You can check if any BPF scheduler has ever been loaded since boot by examining
87this monotonically incrementing counter (a value of zero indicates that no BPF
88scheduler has been loaded):
89
90.. code-block:: none
91
92    # cat /sys/kernel/sched_ext/enable_seq
93    1
94
95``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
96detailed information:
97
98.. code-block:: none
99
100    # tools/sched_ext/scx_show_state.py
101    ops           : simple
102    enabled       : 1
103    switching_all : 1
104    switched_all  : 1
105    enable_state  : enabled (2)
106    bypass_depth  : 0
107    nr_rejected   : 0
108    enable_seq    : 1
109
110Whether a given task is on sched_ext can be determined as follows:
111
112.. code-block:: none
113
114    # grep ext /proc/self/sched
115    ext.enabled                                  :                    1
116
117The Basics
118==========
119
120Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
121programs that implement ``struct sched_ext_ops``. The only mandatory field
122is ``ops.name`` which must be a valid BPF object name. All operations are
123optional. The following modified excerpt is from
124``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
125
126.. code-block:: c
127
128    /*
129     * Decide which CPU a task should be migrated to before being
130     * enqueued (either at wakeup, fork time, or exec time). If an
131     * idle core is found by the default ops.select_cpu() implementation,
132     * then insert the task directly into SCX_DSQ_LOCAL and skip the
133     * ops.enqueue() callback.
134     *
135     * Note that this implementation has exactly the same behavior as the
136     * default ops.select_cpu implementation. The behavior of the scheduler
137     * would be exactly same if the implementation just didn't define the
138     * simple_select_cpu() struct_ops prog.
139     */
140    s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
141                       s32 prev_cpu, u64 wake_flags)
142    {
143            s32 cpu;
144            /* Need to initialize or the BPF verifier will reject the program */
145            bool direct = false;
146
147            cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
148
149            if (direct)
150                    scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
151
152            return cpu;
153    }
154
155    /*
156     * Do a direct insertion of a task to the global DSQ. This ops.enqueue()
157     * callback will only be invoked if we failed to find a core to insert
158     * into in ops.select_cpu() above.
159     *
160     * Note that this implementation has exactly the same behavior as the
161     * default ops.enqueue implementation, which just dispatches the task
162     * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
163     * if the implementation just didn't define the simple_enqueue struct_ops
164     * prog.
165     */
166    void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
167    {
168            scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
169    }
170
171    s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
172    {
173            /*
174             * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
175             * SCHED_BATCH tasks should use sched_ext.
176             */
177            return 0;
178    }
179
180    void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
181    {
182            exit_type = ei->type;
183    }
184
185    SEC(".struct_ops")
186    struct sched_ext_ops simple_ops = {
187            .select_cpu             = (void *)simple_select_cpu,
188            .enqueue                = (void *)simple_enqueue,
189            .init                   = (void *)simple_init,
190            .exit                   = (void *)simple_exit,
191            .name                   = "simple",
192    };
193
194Dispatch Queues
195---------------
196
197To match the impedance between the scheduler core and the BPF scheduler,
198sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
199priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
200and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
201an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
202``scx_bpf_destroy_dsq()``.
203
204A CPU always executes a task from its local DSQ. A task is "inserted" into a
205DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ.
206
207When a CPU is looking for the next task to run, if the local DSQ is not
208empty, the first task is picked. Otherwise, the CPU tries to move a task
209from the global DSQ. If that doesn't yield a runnable task either,
210``ops.dispatch()`` is invoked.
211
212Scheduling Cycle
213----------------
214
215The following briefly shows how a waking task is scheduled and executed.
216
2171. When a task is waking up, ``ops.select_cpu()`` is the first operation
218   invoked. This serves two purposes. First, CPU selection optimization
219   hint. Second, waking up the selected CPU if idle.
220
221   The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
222   binding. The actual decision is made at the last step of scheduling.
223   However, there is a small performance gain if the CPU
224   ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
225
226   A side-effect of selecting a CPU is waking it up from idle. While a BPF
227   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
228   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
229
230   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
231   by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
232   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
233   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
234   Additionally, inserting directly from ``ops.select_cpu()`` will cause the
235   ``ops.enqueue()`` callback to be skipped.
236
237   Note that the scheduler core will ignore an invalid CPU selection, for
238   example, if it's outside the allowed cpumask of the task.
239
2402. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
241   task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
242   can make one of the following decisions:
243
244   * Immediately insert the task into either the global or a local DSQ by
245     calling ``scx_bpf_dsq_insert()`` with one of the following options:
246     ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``.
247
248   * Immediately insert the task into a custom DSQ by calling
249     ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63.
250
251   * Queue the task on the BPF side.
252
2533. When a CPU is ready to schedule, it first looks at its local DSQ. If
254   empty, it then looks at the global DSQ. If there still isn't a task to
255   run, ``ops.dispatch()`` is invoked which can use the following two
256   functions to populate the local DSQ.
257
258   * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be
259     used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
260     ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()``
261     currently can't be called with BPF locks held, this is being worked on
262     and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion
263     rather than performing them immediately. There can be up to
264     ``ops.dispatch_max_batch`` pending tasks.
265
266   * ``scx_bpf_move_to_local()`` moves a task from the specified non-local
267     DSQ to the dispatching DSQ. This function cannot be called with any BPF
268     locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions
269     tasks before trying to move from the specified DSQ.
270
2714. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
272   the CPU runs the first one. If empty, the following steps are taken:
273
274   * Try to move from the global DSQ. If successful, run the task.
275
276   * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
277
278   * If the previous task is an SCX task and still runnable, keep executing
279     it (see ``SCX_OPS_ENQ_LAST``).
280
281   * Go idle.
282
283Note that the BPF scheduler can always choose to dispatch tasks immediately
284in ``ops.enqueue()`` as illustrated in the above simple example. If only the
285built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
286a task is never queued on the BPF scheduler and both the local and global
287DSQs are executed automatically.
288
289``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use
290``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as
291``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
292dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See
293the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c``
294for more information.
295
296Task Lifecycle
297--------------
298
299The following pseudo-code summarizes the entire lifecycle of a task managed
300by a sched_ext scheduler:
301
302.. code-block:: c
303
304    ops.init_task();            /* A new task is created */
305    ops.enable();               /* Enable BPF scheduling for the task */
306
307    while (task in SCHED_EXT) {
308        if (task can migrate)
309            ops.select_cpu();   /* Called on wakeup (optimization) */
310
311        ops.runnable();         /* Task becomes ready to run */
312
313        while (task is runnable) {
314            if (task is not in a DSQ) {
315                ops.enqueue();  /* Task can be added to a DSQ */
316
317                /* A CPU becomes available */
318
319                ops.dispatch(); /* Task is moved to a local DSQ */
320            }
321            ops.running();      /* Task starts running on its assigned CPU */
322            ops.tick();         /* Called every 1/HZ seconds */
323            ops.stopping();     /* Task stops running (time slice expires or wait) */
324        }
325
326        ops.quiescent();        /* Task releases its assigned CPU (wait) */
327    }
328
329    ops.disable();              /* Disable BPF scheduling for the task */
330    ops.exit_task();            /* Task is destroyed */
331
332Where to Look
333=============
334
335* ``include/linux/sched/ext.h`` defines the core data structures, ops table
336  and constants.
337
338* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
339  The functions prefixed with ``scx_bpf_`` can be called from the BPF
340  scheduler.
341
342* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
343
344  * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
345    custom DSQ.
346
347  * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
348    levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
349
350ABI Instability
351===============
352
353The APIs provided by sched_ext to BPF schedulers programs have no stability
354guarantees. This includes the ops table callbacks and constants defined in
355``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
356``kernel/sched/ext.c``.
357
358While we will attempt to provide a relatively stable API surface when
359possible, they are subject to change without warning between kernel
360versions.
361