1========================== 2Extensible Scheduler Class 3========================== 4 5sched_ext is a scheduler class whose behavior can be defined by a set of BPF 6programs - the BPF scheduler. 7 8* sched_ext exports a full scheduling interface so that any scheduling 9 algorithm can be implemented on top. 10 11* The BPF scheduler can group CPUs however it sees fit and schedule them 12 together, as tasks aren't tied to specific CPUs at the time of wakeup. 13 14* The BPF scheduler can be turned on and off dynamically anytime. 15 16* The system integrity is maintained no matter what the BPF scheduler does. 17 The default scheduling behavior is restored anytime an error is detected, 18 a runnable task stalls, or on invoking the SysRq key sequence 19 `SysRq-S`. 20 21* When the BPF scheduler triggers an error, debug information is dumped to 22 aid debugging. The debug dump is passed to and printed out by the 23 scheduler binary. The debug dump can also be accessed through the 24 `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D` 25 triggers a debug dump. This doesn't terminate the BPF scheduler and can 26 only be read through the tracepoint. 27 28Switching to and from sched_ext 29=============================== 30 31``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 32``tools/sched_ext`` contains the example schedulers. The following config 33options should be enabled to use sched_ext: 34 35.. code-block:: none 36 37 CONFIG_BPF=y 38 CONFIG_SCHED_CLASS_EXT=y 39 CONFIG_BPF_SYSCALL=y 40 CONFIG_BPF_JIT=y 41 CONFIG_DEBUG_INFO_BTF=y 42 CONFIG_BPF_JIT_ALWAYS_ON=y 43 CONFIG_BPF_JIT_DEFAULT_ON=y 44 CONFIG_PAHOLE_HAS_SPLIT_BTF=y 45 CONFIG_PAHOLE_HAS_BTF_TAG=y 46 47sched_ext is used only when the BPF scheduler is loaded and running. 48 49If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 50treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is 51loaded. 52 53When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set 54in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and 55``SCHED_EXT`` tasks are scheduled by sched_ext. 56 57However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is 58set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled 59by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and 60``SCHED_IDLE`` policies are scheduled by CFS. 61 62Terminating the sched_ext scheduler program, triggering `SysRq-S`, or 63detection of any internal error including stalled runnable tasks aborts the 64BPF scheduler and reverts all tasks back to CFS. 65 66.. code-block:: none 67 68 # make -j16 -C tools/sched_ext 69 # tools/sched_ext/build/bin/scx_simple 70 local=0 global=3 71 local=5 global=24 72 local=9 global=44 73 local=13 global=56 74 local=17 global=72 75 ^CEXIT: BPF scheduler unregistered 76 77The current status of the BPF scheduler can be determined as follows: 78 79.. code-block:: none 80 81 # cat /sys/kernel/sched_ext/state 82 enabled 83 # cat /sys/kernel/sched_ext/root/ops 84 simple 85 86You can check if any BPF scheduler has ever been loaded since boot by examining 87this monotonically incrementing counter (a value of zero indicates that no BPF 88scheduler has been loaded): 89 90.. code-block:: none 91 92 # cat /sys/kernel/sched_ext/enable_seq 93 1 94 95``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 96detailed information: 97 98.. code-block:: none 99 100 # tools/sched_ext/scx_show_state.py 101 ops : simple 102 enabled : 1 103 switching_all : 1 104 switched_all : 1 105 enable_state : enabled (2) 106 bypass_depth : 0 107 nr_rejected : 0 108 enable_seq : 1 109 110Whether a given task is on sched_ext can be determined as follows: 111 112.. code-block:: none 113 114 # grep ext /proc/self/sched 115 ext.enabled : 1 116 117The Basics 118========== 119 120Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 121programs that implement ``struct sched_ext_ops``. The only mandatory field 122is ``ops.name`` which must be a valid BPF object name. All operations are 123optional. The following modified excerpt is from 124``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 125 126.. code-block:: c 127 128 /* 129 * Decide which CPU a task should be migrated to before being 130 * enqueued (either at wakeup, fork time, or exec time). If an 131 * idle core is found by the default ops.select_cpu() implementation, 132 * then insert the task directly into SCX_DSQ_LOCAL and skip the 133 * ops.enqueue() callback. 134 * 135 * Note that this implementation has exactly the same behavior as the 136 * default ops.select_cpu implementation. The behavior of the scheduler 137 * would be exactly same if the implementation just didn't define the 138 * simple_select_cpu() struct_ops prog. 139 */ 140 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 141 s32 prev_cpu, u64 wake_flags) 142 { 143 s32 cpu; 144 /* Need to initialize or the BPF verifier will reject the program */ 145 bool direct = false; 146 147 cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 148 149 if (direct) 150 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 151 152 return cpu; 153 } 154 155 /* 156 * Do a direct insertion of a task to the global DSQ. This ops.enqueue() 157 * callback will only be invoked if we failed to find a core to insert 158 * into in ops.select_cpu() above. 159 * 160 * Note that this implementation has exactly the same behavior as the 161 * default ops.enqueue implementation, which just dispatches the task 162 * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 163 * if the implementation just didn't define the simple_enqueue struct_ops 164 * prog. 165 */ 166 void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 167 { 168 scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 169 } 170 171 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 172 { 173 /* 174 * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and 175 * SCHED_BATCH tasks should use sched_ext. 176 */ 177 return 0; 178 } 179 180 void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 181 { 182 exit_type = ei->type; 183 } 184 185 SEC(".struct_ops") 186 struct sched_ext_ops simple_ops = { 187 .select_cpu = (void *)simple_select_cpu, 188 .enqueue = (void *)simple_enqueue, 189 .init = (void *)simple_init, 190 .exit = (void *)simple_exit, 191 .name = "simple", 192 }; 193 194Dispatch Queues 195--------------- 196 197To match the impedance between the scheduler core and the BPF scheduler, 198sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 199priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 200and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 201an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and 202``scx_bpf_destroy_dsq()``. 203 204A CPU always executes a task from its local DSQ. A task is "inserted" into a 205DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ. 206 207When a CPU is looking for the next task to run, if the local DSQ is not 208empty, the first task is picked. Otherwise, the CPU tries to move a task 209from the global DSQ. If that doesn't yield a runnable task either, 210``ops.dispatch()`` is invoked. 211 212Scheduling Cycle 213---------------- 214 215The following briefly shows how a waking task is scheduled and executed. 216 2171. When a task is waking up, ``ops.select_cpu()`` is the first operation 218 invoked. This serves two purposes. First, CPU selection optimization 219 hint. Second, waking up the selected CPU if idle. 220 221 The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 222 binding. The actual decision is made at the last step of scheduling. 223 However, there is a small performance gain if the CPU 224 ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 225 226 A side-effect of selecting a CPU is waking it up from idle. While a BPF 227 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 228 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 229 230 A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 231 by calling ``scx_bpf_dsq_insert()``. If the task is inserted into 232 ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the 233 local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 234 Additionally, inserting directly from ``ops.select_cpu()`` will cause the 235 ``ops.enqueue()`` callback to be skipped. 236 237 Note that the scheduler core will ignore an invalid CPU selection, for 238 example, if it's outside the allowed cpumask of the task. 239 2402. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 241 task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` 242 can make one of the following decisions: 243 244 * Immediately insert the task into either the global or a local DSQ by 245 calling ``scx_bpf_dsq_insert()`` with one of the following options: 246 ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``. 247 248 * Immediately insert the task into a custom DSQ by calling 249 ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63. 250 251 * Queue the task on the BPF side. 252 2533. When a CPU is ready to schedule, it first looks at its local DSQ. If 254 empty, it then looks at the global DSQ. If there still isn't a task to 255 run, ``ops.dispatch()`` is invoked which can use the following two 256 functions to populate the local DSQ. 257 258 * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be 259 used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 260 ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()`` 261 currently can't be called with BPF locks held, this is being worked on 262 and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion 263 rather than performing them immediately. There can be up to 264 ``ops.dispatch_max_batch`` pending tasks. 265 266 * ``scx_bpf_move_to_local()`` moves a task from the specified non-local 267 DSQ to the dispatching DSQ. This function cannot be called with any BPF 268 locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions 269 tasks before trying to move from the specified DSQ. 270 2714. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 272 the CPU runs the first one. If empty, the following steps are taken: 273 274 * Try to move from the global DSQ. If successful, run the task. 275 276 * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 277 278 * If the previous task is an SCX task and still runnable, keep executing 279 it (see ``SCX_OPS_ENQ_LAST``). 280 281 * Go idle. 282 283Note that the BPF scheduler can always choose to dispatch tasks immediately 284in ``ops.enqueue()`` as illustrated in the above simple example. If only the 285built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 286a task is never queued on the BPF scheduler and both the local and global 287DSQs are executed automatically. 288 289``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use 290``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as 291``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 292dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See 293the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` 294for more information. 295 296Task Lifecycle 297-------------- 298 299The following pseudo-code summarizes the entire lifecycle of a task managed 300by a sched_ext scheduler: 301 302.. code-block:: c 303 304 ops.init_task(); /* A new task is created */ 305 ops.enable(); /* Enable BPF scheduling for the task */ 306 307 while (task in SCHED_EXT) { 308 if (task can migrate) 309 ops.select_cpu(); /* Called on wakeup (optimization) */ 310 311 ops.runnable(); /* Task becomes ready to run */ 312 313 while (task is runnable) { 314 if (task is not in a DSQ) { 315 ops.enqueue(); /* Task can be added to a DSQ */ 316 317 /* A CPU becomes available */ 318 319 ops.dispatch(); /* Task is moved to a local DSQ */ 320 } 321 ops.running(); /* Task starts running on its assigned CPU */ 322 ops.tick(); /* Called every 1/HZ seconds */ 323 ops.stopping(); /* Task stops running (time slice expires or wait) */ 324 } 325 326 ops.quiescent(); /* Task releases its assigned CPU (wait) */ 327 } 328 329 ops.disable(); /* Disable BPF scheduling for the task */ 330 ops.exit_task(); /* Task is destroyed */ 331 332Where to Look 333============= 334 335* ``include/linux/sched/ext.h`` defines the core data structures, ops table 336 and constants. 337 338* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 339 The functions prefixed with ``scx_bpf_`` can be called from the BPF 340 scheduler. 341 342* ``tools/sched_ext/`` hosts example BPF scheduler implementations. 343 344 * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 345 custom DSQ. 346 347 * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 348 levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 349 350ABI Instability 351=============== 352 353The APIs provided by sched_ext to BPF schedulers programs have no stability 354guarantees. This includes the ops table callbacks and constants defined in 355``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 356``kernel/sched/ext.c``. 357 358While we will attempt to provide a relatively stable API surface when 359possible, they are subject to change without warning between kernel 360versions. 361