Documentation/networking/napi.rst

9 NAPI is the event handling mechanism used by the Linux networking stack.
10 The name NAPI no longer stands for anything in particular [#]_.
12 In basic operation the device notifies the host about new events
14 The host then schedules a NAPI instance to process the events.
15 The device may also be polled for events via NAPI without receiving
18 NAPI processing usually happens in the software interrupt context,
22 All in all NAPI abstracts away from the drivers the context and configuration
28 The two most important elements of NAPI are the struct napi_struct
29 and the associated poll method. struct napi_struct holds the state
30 of the NAPI instance while the method is the driver-specific event
31 handler. The method will typically free Tx packets that have been
40 from the system. The instances are attached to the netdevice passed
44 napi_enable() and napi_disable() manage the disabled state.
46 to not be invoked. napi_disable() waits for ownership of the NAPI
49 The control APIs are not idempotent. Control API calls are safe against
57 napi_schedule() is the basic method of scheduling a NAPI poll.
60 will take ownership of the NAPI instance.
62 Later, after NAPI is scheduled, the driver's poll method will be
63 called to process the events/packets. The method takes a ``budget``
68 In other words for Rx processing the ``budget`` argument limits how many
71 skb Tx processing should happen regardless of the ``budget``, but if
72 the argument is 0 driver cannot call any XDP (or page pool) APIs.
76    The ``budget`` argument may be 0 if core tries to only process
79 The poll method returns the amount of work done. If the driver still
81 the poll method should return exactly ``budget``. In that case,
82 the NAPI instance will be serviced/polled again (without the
86 processed) the poll method should call napi_complete_done()
87 before returning. napi_complete_done() releases the ownership
88 of the instance.
92    The case of finishing all events and using exactly ``budget``
94    (rare) condition to the stack, so the driver must either
98    If the ``budget`` is 0 napi_complete_done() should never be called.
103 Drivers should not make assumptions about the exact sequencing
104 of calls. The poll method may be called without the driver scheduling
105 the instance (unless the instance is disabled). Similarly,
106 it's not guaranteed that the poll method will be called, even
107 if napi_schedule() succeeded (e.g. if the instance gets disabled).
109 As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
110 calls to the poll method only wait for the ownership of the instance
111 to be released, not for the poll method to exit. This means that
120 Drivers should keep the interrupts masked after scheduling
121 the NAPI instance - until NAPI polling finishes any further
124 Drivers which have to mask the interrupts explicitly (as opposed
125 to IRQ being auto-masked by the device) should use the napi_schedule_prep()
154 interface. There is no strong requirement on how the instances are
164 core. Regardless of the queue assignment, however, there is usually still
167 It's worth noting that the ethtool API uses a "channel" terminology where
169 what constitutes a channel; the recommended interpretation is to understand
179 The netif_napi_add_config() API prevents this loss of configuration by
184 be beneficial to userspace programs using ``SO_INCOMING_NAPI_ID``. See the
192 User interactions with NAPI depend on NAPI instance ID. The instance IDs
193 are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
197 the kernel source tree: ``tools/net/ynl/pyynl/cli.py``.
199 For example, using the script to dump all of the queues for a device (which
217 by the device. There are cases where software coalescing is helpful.
220 the hardware interrupts as soon as all packets are processed.
221 The ``gro_flush_timeout`` sysfs configuration of the netdevice
222 is reused to control the delay of the timer, while
223 ``napi_defer_hard_irqs`` controls the number of consecutive empty polls
226 The above parameters can also be set on a per-NAPI basis using netlink via
227 netdev-genl. When used with netlink and configured on a per-NAPI basis, the
232 or by using a script included in the kernel source tree:
235 For example, using the script:
246 Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
259 the device interrupt fires. As is the case with any busy polling it trades
264 selected sockets or using the global ``net.core.busy_poll`` and
273 all file descriptors which are added to an epoll context have the same NAPI ID.
275 If the application uses a dedicated acceptor thread, the application can obtain
276 the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
277 distribute that file descriptor to a worker thread. The worker thread would add
278 the file descriptor to its epoll context. This would ensure each worker thread
279 has an epoll context with FDs that have the same NAPI ID.
281 Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
283 is only given incoming connections with the same NAPI ID. Care must be taken to
291    not be desirable as many applications may not have the need to busy poll.
293 2. Applications using recent kernels can issue an ioctl on the epoll context
304       /* pad the struct to a multiple of 64bits */
319 Such applications can pledge to the kernel that they will perform a busy
320 polling operation periodically, and the driver should keep the device IRQs
321 permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
322 socket option. To avoid system misbehavior the pledge is revoked
324 busy polling applications, the ``prefer_busy_poll`` field of ``struct
325 epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
326 enable this mode. See the above section for more details.
328 The NAPI budget for busy polling is lower than the default (which makes
329 sense given the low latency intention of normal busy polling). This is
330 not the case with IRQ mitigation, however, so the budget can be adjusted
331 with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
332 applications, the ``busy_poll_budget`` field can be adjusted to the desired value
333 in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
334 ioctl. See the above section for more details.
338 when the system is not fully loaded. Choosing a small value for
339 ``gro_flush_timeout`` can cause interference of the user application which is
354 While application calls to epoll_wait successfully retrieve events, the kernel will
355 defer the IRQ suspension timer. If the kernel does not retrieve any events
357 suspension is disabled and the IRQ mitigation strategies described above are
365   1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
366      maximum time (in nanoseconds) the application can have its IRQs
369      the application has stalled. This value should be chosen so that it covers
370      the amount of time the user application needs to process data from its
374   2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
378   3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
379      the ``EPIOCSPARAMS`` ioctl as described above.
381   4. The application uses epoll as described above to trigger NAPI packet
385 userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
386 allows the application to process data without interference.
389 automatically disabled and the ``gro_flush_timeout`` and
394 the duration of one userland processing cycle.
400 IRQ suspension causes the system to alternate between polling mode and
402 overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
403 epoll finds no events, the setting of ``gro_flush_timeout`` and
404 ``napi_defer_hard_irqs`` determine the next step.
426 the recommended usage, because otherwise setting ``irq-suspend-timeout``
436 The configuration is per netdevice and will affect all
440 It is recommended to pin each kernel thread to a single CPU, the same
441 CPU as the CPU which services the interrupt. Note that the mapping
443 dependent). The NAPI instance IDs will be assigned in the opposite
444 order than the process IDs of the kernel threads.
446 Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
450 For example, using the script: