1.. _sched-ext: 2 3========================== 4Extensible Scheduler Class 5========================== 6 7sched_ext is a scheduler class whose behavior can be defined by a set of BPF 8programs - the BPF scheduler. 9 10* sched_ext exports a full scheduling interface so that any scheduling 11 algorithm can be implemented on top. 12 13* The BPF scheduler can group CPUs however it sees fit and schedule them 14 together, as tasks aren't tied to specific CPUs at the time of wakeup. 15 16* The BPF scheduler can be turned on and off dynamically anytime. 17 18* The system integrity is maintained no matter what the BPF scheduler does. 19 The default scheduling behavior is restored anytime an error is detected, 20 a runnable task stalls, or on invoking the SysRq key sequence 21 `SysRq-S`. 22 23* When the BPF scheduler triggers an error, debug information is dumped to 24 aid debugging. The debug dump is passed to and printed out by the 25 scheduler binary. The debug dump can also be accessed through the 26 `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D` 27 triggers a debug dump. This doesn't terminate the BPF scheduler and can 28 only be read through the tracepoint. 29 30Switching to and from sched_ext 31=============================== 32 33``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 34``tools/sched_ext`` contains the example schedulers. The following config 35options should be enabled to use sched_ext: 36 37.. code-block:: none 38 39 CONFIG_BPF=y 40 CONFIG_SCHED_CLASS_EXT=y 41 CONFIG_BPF_SYSCALL=y 42 CONFIG_BPF_JIT=y 43 CONFIG_DEBUG_INFO_BTF=y 44 CONFIG_BPF_JIT_ALWAYS_ON=y 45 CONFIG_BPF_JIT_DEFAULT_ON=y 46 CONFIG_PAHOLE_HAS_SPLIT_BTF=y 47 CONFIG_PAHOLE_HAS_BTF_TAG=y 48 49sched_ext is used only when the BPF scheduler is loaded and running. 50 51If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 52treated as ``SCHED_NORMAL`` and scheduled by the fair-class scheduler until the 53BPF scheduler is loaded. 54 55When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set 56in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and 57``SCHED_EXT`` tasks are scheduled by sched_ext. 58 59However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is 60set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled 61by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and 62``SCHED_IDLE`` policies are scheduled by the fair-class scheduler. 63 64Terminating the sched_ext scheduler program, triggering `SysRq-S`, or 65detection of any internal error including stalled runnable tasks aborts the 66BPF scheduler and reverts all tasks back to the fair-class scheduler. 67 68.. code-block:: none 69 70 # make -j16 -C tools/sched_ext 71 # tools/sched_ext/build/bin/scx_simple 72 local=0 global=3 73 local=5 global=24 74 local=9 global=44 75 local=13 global=56 76 local=17 global=72 77 ^CEXIT: BPF scheduler unregistered 78 79The current status of the BPF scheduler can be determined as follows: 80 81.. code-block:: none 82 83 # cat /sys/kernel/sched_ext/state 84 enabled 85 # cat /sys/kernel/sched_ext/root/ops 86 simple 87 88You can check if any BPF scheduler has ever been loaded since boot by examining 89this monotonically incrementing counter (a value of zero indicates that no BPF 90scheduler has been loaded): 91 92.. code-block:: none 93 94 # cat /sys/kernel/sched_ext/enable_seq 95 1 96 97``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 98detailed information: 99 100.. code-block:: none 101 102 # tools/sched_ext/scx_show_state.py 103 ops : simple 104 enabled : 1 105 switching_all : 1 106 switched_all : 1 107 enable_state : enabled (2) 108 bypass_depth : 0 109 nr_rejected : 0 110 enable_seq : 1 111 112Whether a given task is on sched_ext can be determined as follows: 113 114.. code-block:: none 115 116 # grep ext /proc/self/sched 117 ext.enabled : 1 118 119The Basics 120========== 121 122Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 123programs that implement ``struct sched_ext_ops``. The only mandatory field 124is ``ops.name`` which must be a valid BPF object name. All operations are 125optional. The following modified excerpt is from 126``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 127 128.. code-block:: c 129 130 /* 131 * Decide which CPU a task should be migrated to before being 132 * enqueued (either at wakeup, fork time, or exec time). If an 133 * idle core is found by the default ops.select_cpu() implementation, 134 * then insert the task directly into SCX_DSQ_LOCAL and skip the 135 * ops.enqueue() callback. 136 * 137 * Note that this implementation has exactly the same behavior as the 138 * default ops.select_cpu implementation. The behavior of the scheduler 139 * would be exactly same if the implementation just didn't define the 140 * simple_select_cpu() struct_ops prog. 141 */ 142 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 143 s32 prev_cpu, u64 wake_flags) 144 { 145 s32 cpu; 146 /* Need to initialize or the BPF verifier will reject the program */ 147 bool direct = false; 148 149 cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 150 151 if (direct) 152 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 153 154 return cpu; 155 } 156 157 /* 158 * Do a direct insertion of a task to the global DSQ. This ops.enqueue() 159 * callback will only be invoked if we failed to find a core to insert 160 * into in ops.select_cpu() above. 161 * 162 * Note that this implementation has exactly the same behavior as the 163 * default ops.enqueue implementation, which just dispatches the task 164 * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 165 * if the implementation just didn't define the simple_enqueue struct_ops 166 * prog. 167 */ 168 void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 169 { 170 scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 171 } 172 173 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 174 { 175 /* 176 * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and 177 * SCHED_BATCH tasks should use sched_ext. 178 */ 179 return 0; 180 } 181 182 void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 183 { 184 exit_type = ei->type; 185 } 186 187 SEC(".struct_ops") 188 struct sched_ext_ops simple_ops = { 189 .select_cpu = (void *)simple_select_cpu, 190 .enqueue = (void *)simple_enqueue, 191 .init = (void *)simple_init, 192 .exit = (void *)simple_exit, 193 .name = "simple", 194 }; 195 196Dispatch Queues 197--------------- 198 199To match the impedance between the scheduler core and the BPF scheduler, 200sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 201priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 202and one local DSQ per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 203an arbitrary number of DSQs using ``scx_bpf_create_dsq()`` and 204``scx_bpf_destroy_dsq()``. 205 206A CPU always executes a task from its local DSQ. A task is "inserted" into a 207DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ. 208 209When a CPU is looking for the next task to run, if the local DSQ is not 210empty, the first task is picked. Otherwise, the CPU tries to move a task 211from the global DSQ. If that doesn't yield a runnable task either, 212``ops.dispatch()`` is invoked. 213 214Scheduling Cycle 215---------------- 216 217The following briefly shows how a waking task is scheduled and executed. 218 2191. When a task is waking up, ``ops.select_cpu()`` is the first operation 220 invoked. This serves two purposes. First, CPU selection optimization 221 hint. Second, waking up the selected CPU if idle. 222 223 The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 224 binding. The actual decision is made at the last step of scheduling. 225 However, there is a small performance gain if the CPU 226 ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 227 228 A side-effect of selecting a CPU is waking it up from idle. While a BPF 229 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 230 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 231 232 A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 233 by calling ``scx_bpf_dsq_insert()``. If the task is inserted into 234 ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the 235 local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 236 Additionally, inserting directly from ``ops.select_cpu()`` will cause the 237 ``ops.enqueue()`` callback to be skipped. 238 239 Note that the scheduler core will ignore an invalid CPU selection, for 240 example, if it's outside the allowed cpumask of the task. 241 2422. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 243 task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` 244 can make one of the following decisions: 245 246 * Immediately insert the task into either the global or a local DSQ by 247 calling ``scx_bpf_dsq_insert()`` with one of the following options: 248 ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``. 249 250 * Immediately insert the task into a custom DSQ by calling 251 ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63. 252 253 * Queue the task on the BPF side. 254 2553. When a CPU is ready to schedule, it first looks at its local DSQ. If 256 empty, it then looks at the global DSQ. If there still isn't a task to 257 run, ``ops.dispatch()`` is invoked which can use the following two 258 functions to populate the local DSQ. 259 260 * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be 261 used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 262 ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()`` 263 currently can't be called with BPF locks held, this is being worked on 264 and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion 265 rather than performing them immediately. There can be up to 266 ``ops.dispatch_max_batch`` pending tasks. 267 268 * ``scx_bpf_move_to_local()`` moves a task from the specified non-local 269 DSQ to the dispatching DSQ. This function cannot be called with any BPF 270 locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions 271 tasks before trying to move from the specified DSQ. 272 2734. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 274 the CPU runs the first one. If empty, the following steps are taken: 275 276 * Try to move from the global DSQ. If successful, run the task. 277 278 * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 279 280 * If the previous task is an SCX task and still runnable, keep executing 281 it (see ``SCX_OPS_ENQ_LAST``). 282 283 * Go idle. 284 285Note that the BPF scheduler can always choose to dispatch tasks immediately 286in ``ops.enqueue()`` as illustrated in the above simple example. If only the 287built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 288a task is never queued on the BPF scheduler and both the local and global 289DSQs are executed automatically. 290 291``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use 292``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as 293``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 294dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See 295the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` 296for more information. 297 298Task Lifecycle 299-------------- 300 301The following pseudo-code summarizes the entire lifecycle of a task managed 302by a sched_ext scheduler: 303 304.. code-block:: c 305 306 ops.init_task(); /* A new task is created */ 307 ops.enable(); /* Enable BPF scheduling for the task */ 308 309 while (task in SCHED_EXT) { 310 if (task can migrate) 311 ops.select_cpu(); /* Called on wakeup (optimization) */ 312 313 ops.runnable(); /* Task becomes ready to run */ 314 315 while (task is runnable) { 316 if (task is not in a DSQ) { 317 ops.enqueue(); /* Task can be added to a DSQ */ 318 319 /* A CPU becomes available */ 320 321 ops.dispatch(); /* Task is moved to a local DSQ */ 322 } 323 ops.running(); /* Task starts running on its assigned CPU */ 324 ops.tick(); /* Called every 1/HZ seconds */ 325 ops.stopping(); /* Task stops running (time slice expires or wait) */ 326 } 327 328 ops.quiescent(); /* Task releases its assigned CPU (wait) */ 329 } 330 331 ops.disable(); /* Disable BPF scheduling for the task */ 332 ops.exit_task(); /* Task is destroyed */ 333 334Where to Look 335============= 336 337* ``include/linux/sched/ext.h`` defines the core data structures, ops table 338 and constants. 339 340* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 341 The functions prefixed with ``scx_bpf_`` can be called from the BPF 342 scheduler. 343 344* ``tools/sched_ext/`` hosts example BPF scheduler implementations. 345 346 * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 347 custom DSQ. 348 349 * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 350 levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 351 352ABI Instability 353=============== 354 355The APIs provided by sched_ext to BPF schedulers programs have no stability 356guarantees. This includes the ops table callbacks and constants defined in 357``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 358``kernel/sched/ext.c``. 359 360While we will attempt to provide a relatively stable API surface when 361possible, they are subject to change without warning between kernel 362versions. 363