Documentation/admin-guide/cpu-isolation.rst

.. SPDX-License-Identifier: GPL-2.0

=============
CPU Isolation
=============

Introduction
============

"CPU Isolation" means leaving a CPU exclusive to a given workload
without any undesired code interference from the kernel.

Those interferences, commonly pointed out as "noise", can be triggered
by asynchronous events (interrupts, timers, scheduler preemption by
workqueues and kthreads, ...) or synchronous events (syscalls and page
faults).

Such noise usually goes unnoticed. After all, synchronous events are a
component of the requested kernel service. And asynchronous events are
either sufficiently well-distributed by the scheduler when executed
as tasks or reasonably fast when executed as interrupt. The timer
interrupt can even execute 1024 times per seconds without a significant
and measurable impact most of the time.

However some rare and extreme workloads can be quite sensitive to
those kinds of noise. This is the case, for example, with high
bandwidth network processing that can't afford losing a single packet
or very low latency network processing. Typically those use cases
involve DPDK, bypassing the kernel networking stack and performing
direct access to the networking device from userspace.

In order to run a CPU without or with limited kernel noise, the
related housekeeping work needs to be either shut down, migrated or
offloaded.

Housekeeping
============

In the CPU isolation terminology, housekeeping is the work, often
asynchronous, that the kernel needs to process in order to maintain
all its services. It matches the noises and disturbances enumerated
above except when at least one CPU is isolated. Then housekeeping may
make use of further coping mechanisms if CPU-tied work must be
offloaded.

Housekeeping CPUs are the non-isolated CPUs where the kernel noise
is moved away from isolated CPUs.

The isolation can be implemented in several ways depending on the
nature of the noise:

- Unbound work, where "unbound" means not tied to any CPU, can be
  simply migrated away from isolated CPUs to housekeeping CPUs.
  This is the case of unbound workqueues, kthreads and timers.

- Bound work, where "bound" means tied to a specific CPU, usually
  can't be moved away as-is by nature. Either:

	- The work must switch to a locked implementation. E.g.:
	  This is the case of RCU with CONFIG_RCU_NOCB_CPU.

	- The related feature must be shut down and considered
	  incompatible with isolated CPUs. E.g.: Lockup watchdog,
	  unreliable clocksources, etc...

	- An elaborate and heavyweight coping mechanism stands as a
	  replacement. E.g.: the timer tick is shut down on nohz_full
	  CPUs but with the constraint of running a single task on
	  them. A significant cost penalty is added on kernel entry/exit
	  and a residual 1Hz scheduler tick is offloaded to housekeeping
	  CPUs.

In any case, housekeeping work has to be handled, which is why there
must be at least one housekeeping CPU in the system, preferably more
if the machine runs a lot of CPUs. For example one per node on NUMA
systems.

Also CPU isolation often means a tradeoff between noise-free isolated
CPUs and added overhead on housekeeping CPUs, sometimes even on
isolated CPUs entering the kernel.

Isolation features
==================

Different levels of isolation can be configured in the kernel, each of
which has its own drawbacks and tradeoffs.

Scheduler domain isolation
--------------------------

This feature isolates a CPU from the scheduler topology. As a result,
the target isn't part of the load balancing. Tasks won't migrate
either from or to it unless affined explicitly.

As a side effect the CPU is also isolated from unbound workqueues and
unbound kthreads.

Requirements
~~~~~~~~~~~~

- CONFIG_CPUSETS=y for the cpusets-based interface

Tradeoffs
~~~~~~~~~

By nature, the system load is overall less distributed since some CPUs
are extracted from the global load balancing.

Interfaces
~~~~~~~~~~

- Documentation/admin-guide/cgroup-v2.rst cpuset isolated partitions are recommended
  because they are tunable at runtime.

- The 'isolcpus=' kernel boot parameter with the 'domain' flag is a
  less flexible alternative that doesn't allow for runtime
  reconfiguration.

IRQs isolation
--------------

Isolate the IRQs whenever possible, so that they don't fire on the
target CPUs.

Interfaces
~~~~~~~~~~

- The file /proc/irq/\*/smp_affinity as explained in detail in
  Documentation/core-api/irq/irq-affinity.rst page.

- The "irqaffinity=" kernel boot parameter for a default setting.

- The "managed_irq" flag in the "isolcpus=" kernel boot parameter
  tries a best effort affinity override for managed IRQs.

Full Dynticks (aka nohz_full)
-----------------------------

Full dynticks extends the dynticks idle mode, which stops the tick when
the CPU is idle, to CPUs running a single task in userspace. That is,
the timer tick is stopped if the environment allows it.

Global timer callbacks are also isolated from the nohz_full CPUs.

Requirements
~~~~~~~~~~~~

- CONFIG_NO_HZ_FULL=y

Constraints
~~~~~~~~~~~

- The isolated CPUs must run a single task only. Multitask requires
  the tick to maintain preemption. This is usually fine since the
  workload usually can't stand the latency of random context switches.

- No call to the kernel from isolated CPUs, at the risk of triggering
  random noise.

- No use of POSIX CPU timers on isolated CPUs.

- Architecture must have a stable and reliable clocksource (no
  unreliable TSC that requires the watchdog).


Tradeoffs
~~~~~~~~~

In terms of cost, this is the most invasive isolation feature. It is
assumed to be used when the workload spends most of its time in
userspace and doesn't rely on the kernel except for preparatory
work because:

- RCU adds more overhead due to the locked, offloaded and threaded
  callbacks processing (the same that would be obtained with "rcu_nocbs"
  boot parameter).

- Kernel entry/exit through syscalls, exceptions and IRQs are more
  costly due to fully ordered RmW operations that maintain userspace
  as RCU extended quiescent state. Also the CPU time is accounted on
  kernel boundaries instead of periodically from the tick.

- Housekeeping CPUs must run a 1Hz residual remote scheduler tick
  on behalf of the isolated CPUs.

Checklist
=========

You have set up each of the above isolation features but you still
observe jitters that trash your workload? Make sure to check a few
elements before proceeding.

Some of these checklist items are similar to those of real-time
workloads:

- Use mlock() to prevent your pages from being swapped away. Page
  faults are usually not compatible with jitter sensitive workloads.

- Avoid SMT to prevent your hardware thread from being "preempted"
  by another one.

- CPU frequency changes may induce subtle sorts of jitter in a
  workload. Cpufreq should be used and tuned with caution.

- Deep C-states may result in latency issues upon wake-up. If this
  happens to be a problem, C-states can be limited via kernel boot
  parameters such as processor.max_cstate or intel_idle.max_cstate.
  More finegrained tunings are described in
  Documentation/admin-guide/pm/cpuidle.rst page

- Your system may be subject to firmware-originating interrupts - x86 has
  System Management Interrupts (SMIs) for example. Check your system BIOS
  to disable such interference, and with some luck your vendor will have
  a BIOS tuning guidance for low-latency operations.


Full isolation example
======================

In this example, the system has 8 CPUs and the 8th is to be fully
isolated. Since CPUs start from 0, the 8th CPU is CPU 7.

Kernel parameters
-----------------

Set the following kernel boot parameters to disable SMT and setup tick
and IRQ isolation:

- Full dynticks: nohz_full=7

- IRQs isolation: irqaffinity=0-6

- Managed IRQs isolation: isolcpus=managed_irq,7

- Prevent SMT: nosmt

The full command line is then:

  nohz_full=7 irqaffinity=0-6 isolcpus=managed_irq,7 nosmt

CPUSET configuration (cgroup v2)
--------------------------------

Assuming cgroup v2 is mounted to /sys/fs/cgroup, the following script
isolates CPU 7 from scheduler domains.

::

  cd /sys/fs/cgroup
  # Activate the cpuset subsystem
  echo +cpuset > cgroup.subtree_control
  # Create partition to be isolated
  mkdir test
  cd test
  echo +cpuset > cgroup.subtree_control
  # Isolate CPU 7
  echo 7 > cpuset.cpus
  echo "isolated" > cpuset.cpus.partition

The userspace workload
----------------------

Fake a pure userspace workload, the program below runs a dummy
userspace loop on the isolated CPU 7.

::

  #include <stdio.h>
  #include <fcntl.h>
  #include <unistd.h>
  #include <errno.h>
  int main(void)
  {
      // Move the current task to the isolated cpuset (bind to CPU 7)
      int fd = open("/sys/fs/cgroup/test/cgroup.procs", O_WRONLY);
      if (fd < 0) {
          perror("Can't open cpuset file...\n");
          return 0;
      }

      write(fd, "0\n", 2);
      close(fd);

      // Run an endless dummy loop until the launcher kills us
      while (1)
      ;

      return 0;
  }

Build it and save for later step:

::

  # gcc user_loop.c -o user_loop

The launcher
------------

The below launcher runs the above program for 10 seconds and traces
the noise resulting from preempting tasks and IRQs.

::

  TRACING=/sys/kernel/tracing/
  # Make sure tracing is off for now
  echo 0 > $TRACING/tracing_on
  # Flush previous traces
  echo > $TRACING/trace
  # Record disturbance from other tasks
  echo 1 > $TRACING/events/sched/sched_switch/enable
  # Record disturbance from interrupts
  echo 1 > $TRACING/events/irq_vectors/enable
  # Now we can start tracing
  echo 1 > $TRACING/tracing_on
  # Run the dummy user_loop for 10 seconds on CPU 7
  ./user_loop &
  USER_LOOP_PID=$!
  sleep 10
  kill $USER_LOOP_PID
  # Disable tracing and save traces from CPU 7 in a file
  echo 0 > $TRACING/tracing_on
  cat $TRACING/per_cpu/cpu7/trace > trace.7

If no specific problem arose, the output of trace.7 should look like
the following:

::

  <idle>-0 [007] d..2. 1980.976624: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=user_loop next_pid=1553 next_prio=120
  user_loop-1553 [007] d.h.. 1990.946593: reschedule_entry: vector=253
  user_loop-1553 [007] d.h.. 1990.946593: reschedule_exit: vector=253

That is, no specific noise triggered between the first trace and the
second during 10 seconds when user_loop was running.

Debugging
=========

Of course things are never so easy, especially on this matter.
Chances are that actual noise will be observed in the aforementioned
trace.7 file.

The best way to investigate further is to enable finer grained
tracepoints such as those of subsystems producing asynchronous
events: workqueue, timer, irq_vector, etc... It also can be
interesting to enable the tick_stop event to diagnose why the tick is
retained when that happens.

Some tools may also be useful for higher level analysis:

- Documentation/tools/rtla/rtla.rst provides a suite of tools to analyze
  latency and noise in the system. For example Documentation/tools/rtla/rtla-osnoise.rst
  runs a kernel tracer that analyzes and output a summary of the noises.

- dynticks-testing does something similar to rtla-osnoise but in userspace. It is available
  at git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git