xref: /linux/Documentation/core-api/irq/managed_irq.rst (revision 5181afcdf99527dd92a88f80fc4d0d8013e1b510)
1.. SPDX-License-Identifier: GPL-2.0
2
3===========================
4Affinity managed interrupts
5===========================
6
7The IRQ core provides support for managing interrupts according to a specified
8CPU affinity. Under normal operation, an interrupt is associated with a
9particular CPU. If that CPU is taken offline, the interrupt is migrated to
10another online CPU.
11
12Devices with large numbers of interrupt vectors can stress the available vector
13space. For example, an NVMe device with 128 I/O queues typically requests one
14interrupt per queue on systems with at least 128 CPUs. Two such devices
15therefore request 256 interrupts. On x86, the interrupt vector space is
16notoriously low, providing only 256 vectors per CPU, and the kernel reserves a
17subset of these, further reducing the number available for device interrupts.
18In practice this is not an issue because the interrupts are distributed across
19many CPUs, so each CPU only receives a small number of vectors.
20
21During system suspend, however, all secondary CPUs are taken offline and all
22interrupts are migrated to the single CPU that remains online. This can exhaust
23the available interrupt vectors on that CPU and cause the suspend operation to
24fail.
25
26Affinity‑managed interrupts address this limitation. Each interrupt is assigned
27a CPU affinity mask that specifies the set of CPUs on which the interrupt may
28be targeted. When a CPU in the mask goes offline, the interrupt is moved to the
29next CPU in the mask. If the last CPU in the mask goes offline, the interrupt
30is shut down. Drivers using affinity‑managed interrupts must ensure that the
31associated queue is quiesced before the interrupt is disabled so that no
32further interrupts are generated. When a CPU in the affinity mask comes back
33online, the interrupt is re‑enabled.
34
35Implementation
36--------------
37
38Devices must provide per‑instance interrupts, such as per‑I/O‑queue interrupts
39for storage devices like NVMe. The driver allocates interrupt vectors with the
40required affinity settings using struct irq_affinity. For MSI‑X devices, this
41is done via pci_alloc_irq_vectors_affinity() with the PCI_IRQ_AFFINITY flag
42set.
43
44Based on the provided affinity information, the IRQ core attempts to spread the
45interrupts evenly across the system. The affinity masks are computed during
46this allocation step, but the final IRQ assignment is performed when
47request_irq() is invoked.
48
49Isolated CPUs
50-------------
51
52The affinity of managed interrupts is handled entirely in the kernel and cannot
53be modified from user space through the /proc interfaces. The managed_irq
54sub‑parameter of the isolcpus boot option specifies a CPU mask that managed
55interrupts should attempt to avoid. This isolation is best‑effort and only
56applies if the automatically assigned interrupt mask also contains online CPUs
57outside the avoided mask. If the requested mask contains only isolated CPUs,
58the setting has no effect.
59
60CPUs listed in the avoided mask remain part of the interrupt’s affinity mask.
61This means that if all non‑isolated CPUs go offline while isolated CPUs remain
62online, the interrupt will be assigned to one of the isolated CPUs.
63
64The following examples assume a system with 8 CPUs.
65
66- A QEMU instance is booted with "-device virtio-scsi-pci".
67  The MSI‑X device exposes 11 interrupts: 3 "management" interrupts and 8
68  "queue" interrupts. The driver requests the 8 queue interrupts, each of which
69  is affine to exactly one CPU. If that CPU goes offline, the interrupt is shut
70  down.
71
72  Assuming interrupt 48 is one of the queue interrupts, the following appears::
73
74    /proc/irq/48/effective_affinity_list:7
75    /proc/irq/48/smp_affinity_list:7
76
77  This indicates that the interrupt is served only by CPU7. Shutting down CPU7
78  does not migrate the interrupt to another CPU::
79
80    /proc/irq/48/effective_affinity_list:0
81    /proc/irq/48/smp_affinity_list:7
82
83  This can be verified via the debugfs interface
84  (/sys/kernel/debug/irq/irqs/48). The dstate field will include
85  IRQD_IRQ_DISABLED, IRQD_IRQ_MASKED and IRQD_MANAGED_SHUTDOWN.
86
87- A QEMU instance is booted with "-device virtio-scsi-pci,num_queues=2"
88  and the kernel command line includes:
89  "irqaffinity=0,1 isolcpus=domain,2-7 isolcpus=managed_irq,1-3,5-7".
90  The MSI‑X device exposes 5 interrupts: 3 management interrupts and 2 queue
91  interrupts. The management interrupts follow the irqaffinity= setting. The
92  queue interrupts are spread across available CPUs::
93
94    /proc/irq/47/effective_affinity_list:0
95    /proc/irq/47/smp_affinity_list:0-3
96    /proc/irq/48/effective_affinity_list:4
97    /proc/irq/48/smp_affinity_list:4-7
98
99  The two queue interrupts are evenly distributed. Interrupt 48 is placed on CPU4
100  because the managed_irq mask avoids CPUs 5–7 when possible.
101
102  Replacing the managed_irq argument with "isolcpus=managed_irq,1-3,4-5,7"
103  results in::
104
105    /proc/irq/48/effective_affinity_list:6
106    /proc/irq/48/smp_affinity_list:4-7
107
108  Interrupt 48 is now served on CPU6 because the system avoids CPUs 4, 5 and
109  7. If CPU6 is taken offline, the interrupt migrates to one of the "isolated"
110  CPUs::
111
112    /proc/irq/48/effective_affinity_list:7
113    /proc/irq/48/smp_affinity_list:4-7
114
115  The interrupt is shut down once all CPUs listed in its smp_affinity mask are
116  offline.
117