1.. SPDX-License-Identifier: GPL-2.0 2 3============= 4CPU Isolation 5============= 6 7Introduction 8============ 9 10"CPU Isolation" means leaving a CPU exclusive to a given workload 11without any undesired code interference from the kernel. 12 13Those interferences, commonly pointed out as "noise", can be triggered 14by asynchronous events (interrupts, timers, scheduler preemption by 15workqueues and kthreads, ...) or synchronous events (syscalls and page 16faults). 17 18Such noise usually goes unnoticed. After all, synchronous events are a 19component of the requested kernel service. And asynchronous events are 20either sufficiently well-distributed by the scheduler when executed 21as tasks or reasonably fast when executed as interrupt. The timer 22interrupt can even execute 1024 times per seconds without a significant 23and measurable impact most of the time. 24 25However some rare and extreme workloads can be quite sensitive to 26those kinds of noise. This is the case, for example, with high 27bandwidth network processing that can't afford losing a single packet 28or very low latency network processing. Typically those use cases 29involve DPDK, bypassing the kernel networking stack and performing 30direct access to the networking device from userspace. 31 32In order to run a CPU without or with limited kernel noise, the 33related housekeeping work needs to be either shut down, migrated or 34offloaded. 35 36Housekeeping 37============ 38 39In the CPU isolation terminology, housekeeping is the work, often 40asynchronous, that the kernel needs to process in order to maintain 41all its services. It matches the noises and disturbances enumerated 42above except when at least one CPU is isolated. Then housekeeping may 43make use of further coping mechanisms if CPU-tied work must be 44offloaded. 45 46Housekeeping CPUs are the non-isolated CPUs where the kernel noise 47is moved away from isolated CPUs. 48 49The isolation can be implemented in several ways depending on the 50nature of the noise: 51 52- Unbound work, where "unbound" means not tied to any CPU, can be 53 simply migrated away from isolated CPUs to housekeeping CPUs. 54 This is the case of unbound workqueues, kthreads and timers. 55 56- Bound work, where "bound" means tied to a specific CPU, usually 57 can't be moved away as-is by nature. Either: 58 59 - The work must switch to a locked implementation. E.g.: 60 This is the case of RCU with CONFIG_RCU_NOCB_CPU. 61 62 - The related feature must be shut down and considered 63 incompatible with isolated CPUs. E.g.: Lockup watchdog, 64 unreliable clocksources, etc... 65 66 - An elaborate and heavyweight coping mechanism stands as a 67 replacement. E.g.: the timer tick is shut down on nohz_full 68 CPUs but with the constraint of running a single task on 69 them. A significant cost penalty is added on kernel entry/exit 70 and a residual 1Hz scheduler tick is offloaded to housekeeping 71 CPUs. 72 73In any case, housekeeping work has to be handled, which is why there 74must be at least one housekeeping CPU in the system, preferably more 75if the machine runs a lot of CPUs. For example one per node on NUMA 76systems. 77 78Also CPU isolation often means a tradeoff between noise-free isolated 79CPUs and added overhead on housekeeping CPUs, sometimes even on 80isolated CPUs entering the kernel. 81 82Isolation features 83================== 84 85Different levels of isolation can be configured in the kernel, each of 86which has its own drawbacks and tradeoffs. 87 88Scheduler domain isolation 89-------------------------- 90 91This feature isolates a CPU from the scheduler topology. As a result, 92the target isn't part of the load balancing. Tasks won't migrate 93either from or to it unless affined explicitly. 94 95As a side effect the CPU is also isolated from unbound workqueues and 96unbound kthreads. 97 98Requirements 99~~~~~~~~~~~~ 100 101- CONFIG_CPUSETS=y for the cpusets-based interface 102 103Tradeoffs 104~~~~~~~~~ 105 106By nature, the system load is overall less distributed since some CPUs 107are extracted from the global load balancing. 108 109Interfaces 110~~~~~~~~~~ 111 112- Documentation/admin-guide/cgroup-v2.rst cpuset isolated partitions are recommended 113 because they are tunable at runtime. 114 115- The 'isolcpus=' kernel boot parameter with the 'domain' flag is a 116 less flexible alternative that doesn't allow for runtime 117 reconfiguration. 118 119IRQs isolation 120-------------- 121 122Isolate the IRQs whenever possible, so that they don't fire on the 123target CPUs. 124 125Interfaces 126~~~~~~~~~~ 127 128- The file /proc/irq/\*/smp_affinity as explained in detail in 129 Documentation/core-api/irq/irq-affinity.rst page. 130 131- The "irqaffinity=" kernel boot parameter for a default setting. 132 133- The "managed_irq" flag in the "isolcpus=" kernel boot parameter 134 tries a best effort affinity override for managed IRQs. 135 136Full Dynticks (aka nohz_full) 137----------------------------- 138 139Full dynticks extends the dynticks idle mode, which stops the tick when 140the CPU is idle, to CPUs running a single task in userspace. That is, 141the timer tick is stopped if the environment allows it. 142 143Global timer callbacks are also isolated from the nohz_full CPUs. 144 145Requirements 146~~~~~~~~~~~~ 147 148- CONFIG_NO_HZ_FULL=y 149 150Constraints 151~~~~~~~~~~~ 152 153- The isolated CPUs must run a single task only. Multitask requires 154 the tick to maintain preemption. This is usually fine since the 155 workload usually can't stand the latency of random context switches. 156 157- No call to the kernel from isolated CPUs, at the risk of triggering 158 random noise. 159 160- No use of POSIX CPU timers on isolated CPUs. 161 162- Architecture must have a stable and reliable clocksource (no 163 unreliable TSC that requires the watchdog). 164 165 166Tradeoffs 167~~~~~~~~~ 168 169In terms of cost, this is the most invasive isolation feature. It is 170assumed to be used when the workload spends most of its time in 171userspace and doesn't rely on the kernel except for preparatory 172work because: 173 174- RCU adds more overhead due to the locked, offloaded and threaded 175 callbacks processing (the same that would be obtained with "rcu_nocbs" 176 boot parameter). 177 178- Kernel entry/exit through syscalls, exceptions and IRQs are more 179 costly due to fully ordered RmW operations that maintain userspace 180 as RCU extended quiescent state. Also the CPU time is accounted on 181 kernel boundaries instead of periodically from the tick. 182 183- Housekeeping CPUs must run a 1Hz residual remote scheduler tick 184 on behalf of the isolated CPUs. 185 186Checklist 187========= 188 189You have set up each of the above isolation features but you still 190observe jitters that trash your workload? Make sure to check a few 191elements before proceeding. 192 193Some of these checklist items are similar to those of real-time 194workloads: 195 196- Use mlock() to prevent your pages from being swapped away. Page 197 faults are usually not compatible with jitter sensitive workloads. 198 199- Avoid SMT to prevent your hardware thread from being "preempted" 200 by another one. 201 202- CPU frequency changes may induce subtle sorts of jitter in a 203 workload. Cpufreq should be used and tuned with caution. 204 205- Deep C-states may result in latency issues upon wake-up. If this 206 happens to be a problem, C-states can be limited via kernel boot 207 parameters such as processor.max_cstate or intel_idle.max_cstate. 208 More finegrained tunings are described in 209 Documentation/admin-guide/pm/cpuidle.rst page 210 211- Your system may be subject to firmware-originating interrupts - x86 has 212 System Management Interrupts (SMIs) for example. Check your system BIOS 213 to disable such interference, and with some luck your vendor will have 214 a BIOS tuning guidance for low-latency operations. 215 216 217Full isolation example 218====================== 219 220In this example, the system has 8 CPUs and the 8th is to be fully 221isolated. Since CPUs start from 0, the 8th CPU is CPU 7. 222 223Kernel parameters 224----------------- 225 226Set the following kernel boot parameters to disable SMT and setup tick 227and IRQ isolation: 228 229- Full dynticks: nohz_full=7 230 231- IRQs isolation: irqaffinity=0-6 232 233- Managed IRQs isolation: isolcpus=managed_irq,7 234 235- Prevent SMT: nosmt 236 237The full command line is then: 238 239 nohz_full=7 irqaffinity=0-6 isolcpus=managed_irq,7 nosmt 240 241CPUSET configuration (cgroup v2) 242-------------------------------- 243 244Assuming cgroup v2 is mounted to /sys/fs/cgroup, the following script 245isolates CPU 7 from scheduler domains. 246 247:: 248 249 cd /sys/fs/cgroup 250 # Activate the cpuset subsystem 251 echo +cpuset > cgroup.subtree_control 252 # Create partition to be isolated 253 mkdir test 254 cd test 255 echo +cpuset > cgroup.subtree_control 256 # Isolate CPU 7 257 echo 7 > cpuset.cpus 258 echo "isolated" > cpuset.cpus.partition 259 260The userspace workload 261---------------------- 262 263Fake a pure userspace workload, the program below runs a dummy 264userspace loop on the isolated CPU 7. 265 266:: 267 268 #include <stdio.h> 269 #include <fcntl.h> 270 #include <unistd.h> 271 #include <errno.h> 272 int main(void) 273 { 274 // Move the current task to the isolated cpuset (bind to CPU 7) 275 int fd = open("/sys/fs/cgroup/test/cgroup.procs", O_WRONLY); 276 if (fd < 0) { 277 perror("Can't open cpuset file...\n"); 278 return 0; 279 } 280 281 write(fd, "0\n", 2); 282 close(fd); 283 284 // Run an endless dummy loop until the launcher kills us 285 while (1) 286 ; 287 288 return 0; 289 } 290 291Build it and save for later step: 292 293:: 294 295 # gcc user_loop.c -o user_loop 296 297The launcher 298------------ 299 300The below launcher runs the above program for 10 seconds and traces 301the noise resulting from preempting tasks and IRQs. 302 303:: 304 305 TRACING=/sys/kernel/tracing/ 306 # Make sure tracing is off for now 307 echo 0 > $TRACING/tracing_on 308 # Flush previous traces 309 echo > $TRACING/trace 310 # Record disturbance from other tasks 311 echo 1 > $TRACING/events/sched/sched_switch/enable 312 # Record disturbance from interrupts 313 echo 1 > $TRACING/events/irq_vectors/enable 314 # Now we can start tracing 315 echo 1 > $TRACING/tracing_on 316 # Run the dummy user_loop for 10 seconds on CPU 7 317 ./user_loop & 318 USER_LOOP_PID=$! 319 sleep 10 320 kill $USER_LOOP_PID 321 # Disable tracing and save traces from CPU 7 in a file 322 echo 0 > $TRACING/tracing_on 323 cat $TRACING/per_cpu/cpu7/trace > trace.7 324 325If no specific problem arose, the output of trace.7 should look like 326the following: 327 328:: 329 330 <idle>-0 [007] d..2. 1980.976624: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=user_loop next_pid=1553 next_prio=120 331 user_loop-1553 [007] d.h.. 1990.946593: reschedule_entry: vector=253 332 user_loop-1553 [007] d.h.. 1990.946593: reschedule_exit: vector=253 333 334That is, no specific noise triggered between the first trace and the 335second during 10 seconds when user_loop was running. 336 337Debugging 338========= 339 340Of course things are never so easy, especially on this matter. 341Chances are that actual noise will be observed in the aforementioned 342trace.7 file. 343 344The best way to investigate further is to enable finer grained 345tracepoints such as those of subsystems producing asynchronous 346events: workqueue, timer, irq_vector, etc... It also can be 347interesting to enable the tick_stop event to diagnose why the tick is 348retained when that happens. 349 350Some tools may also be useful for higher level analysis: 351 352- Documentation/tools/rtla/rtla.rst provides a suite of tools to analyze 353 latency and noise in the system. For example Documentation/tools/rtla/rtla-osnoise.rst 354 runs a kernel tracer that analyzes and output a summary of the noises. 355 356- dynticks-testing does something similar to rtla-osnoise but in userspace. It is available 357 at git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git 358