Documentation/networking/scaling.rst

1 .. SPDX-License-Identifier: GPL-2.0
12 networking stack to increase parallelism and improve performance for
13 multi-processor systems.
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
28 (multi-queue). On reception, a NIC can send different packets to different
29 queues to distribute processing among CPUs. The NIC distributes packets by
30 applying a filter to each packet that assigns it to one of a small number
31 of logical flows. Packets for each flow are steered to a separate receive
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
34 the other scaling techniques is to increase performance uniformly.
35 Multi-queue distribution can also be used for traffic prioritization, but
39 and/or transport layer headers-- for example, a 4-tuple hash over
41 implementation of RSS uses a 128-entry indirection table where each entry
51 both directions of the flow to land on the same Rx queue (and CPU). The
52 "Symmetric-XOR" and "Symmetric-OR-XOR" are types of RSS algorithms that
57 Specifically, the "Symmetric-XOR" algorithm XORs the input
62 The "Symmetric-OR-XOR" algorithm, on the other hand, transforms the input as
67 The result is then fed to the underlying RSS algorithm.
69 Some advanced NICs allow steering packets to queues based on
71 can be directed to their own receive queue. Such “n-tuple” filters can
72 be configured from ethtool (--config-ntuple).
76 -----------------
78 The driver for a multi-queue capable NIC typically provides a kernel
79 module parameter for specifying the number of hardware queues to
81 num_queues. A typical RSS configuration would be to have one receive queue
83 one for each memory domain, where a memory domain is a set of CPUs that
88 default mapping is to distribute the queues evenly in the table, but the
90 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
91 indirection table could be done to give different queues different
99 this to notify a CPU when new packets arrive on the given queue. The
100 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
101 that can route each interrupt to a particular CPU. The active mapping
102 of queues to IRQs can be determined from /proc/interrupts. By default,
103 an IRQ may be handled on any CPU. Because a non-negligible part of packet
105 to spread receive interrupts between CPUs. To manually adjust the IRQ
106 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
117 is to allocate as many queues as there are CPUs in the system (or the
118 NIC maximum, if lower). The most efficient high-rate configuration
119 is likely the one with the smallest number of receive queues where no
120 receive queue overflows due to a saturated CPU, because in default
124 Per-cpu load can be observed using the mpstat utility, but note that on
127 initial tests, so limit the number of queues to the number of CPU cores
133 Modern NICs support creating multiple co-existing RSS configurations
135 useful when application wants to constrain the set of queues receiving
137 The example below shows how to direct all traffic to TCP port 22
138 to queues 0 and 1.
140 To create an additional RSS context use::
142   # ethtool -X eth0 hfunc toeplitz context new
149   # ethtool -x eth0 context 1
154   # ethtool -X eth0 equal 2 context 1
155   # ethtool -x eth0 context 1
161 To make use of the new context direct traffic to it using an n-tuple
164   # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
169   # ethtool -N eth0 delete 1023
170   # ethtool -X eth0 context 1 delete
179 interrupt handler, RPS selects the CPU to perform protocol processing
185 2) software filters can easily be added to hash over new protocols
187    introduce inter-processor interrupts (IPIs))
194 The first step in determining the target CPU for RPS is to calculate a
195 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
201 skb->hash and can be used elsewhere in the stack as a hash of the
204 Each receive hardware queue has an associated list of CPUs to which
208 and the packet is queued to the tail of that CPU’s backlog queue. At
209 the end of the bottom half routine, IPIs are sent to any CPUs for which
210 packets have been queued to their backlog queue. The IPI wakes backlog
216 -----------------
220 explicitly configured. The list of CPUs to which RPS may forward traffic
223   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
227 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
234 For a single queue device, a typical RPS configuration would be to set
235 the rps_cpus to the CPUs in the same memory domain of the interrupting
237 the system. At high interrupt rate, it might be wise to exclude the
240 For a multi-queue system, if RSS is configured so that a hardware
241 receive queue is mapped to each CPU, then RPS is probably redundant
248 --------------
251 reordering. The trade-off to sending all packets from the same flow
252 to the same CPU is CPU load imbalance if flows vary in packet rate.
254 common server workloads with many concurrent connections, such
263 net.core.netdev_max_backlog), the kernel starts a per-flow packet
277 turned on. It is implemented for each CPU independently (to avoid lock
284 Per-flow rate is calculated by hashing each packet into a hashtable
285 bucket and incrementing a per-bucket counter. The hash function is
287 be much larger than the number of CPUs, flow limit has finer-grained
300 Flow limit is useful on systems with many concurrent connections,
305 The feature depends on the input packet queue length to exceed
307 Setting net.core.netdev_max_backlog to either 1000 or 10000
317 (RFS). The goal of RFS is to increase datacache hitrate by steering
318 kernel processing of packets to the CPU where the application thread
320 to enqueue packets onto the backlog of another CPU and to wake up that
325 flows to the CPUs where those flows are being processed. The flow hash
326 (see RPS section above) is used to calculate the index into this table.
327 The CPU recorded in each entry is the one which last processed the flow.
328 If an entry does not hold a valid CPU, then packets mapped to that entry
329 are steered using plain RPS. Multiple table entries may point to the
330 same CPU. Indeed, with many flows and few CPUs, it is very likely that
331 a single application thread handles flows with many different flow hashes.
335 Each table value is a CPU index that is updated during calls to recvmsg
339 When the scheduler moves a thread to a new CPU while it has outstanding
340 receive packets on the old CPU, packets may arrive out of order. To
341 avoid this, RFS uses a second flow table to track outstanding packets
342 for each flow: rps_dev_flow_table is a table specific to each hardware
357 entry i is actually selected by hash and multiple flows may hash to the
366 the current CPU is updated to match the desired CPU if one of the
369   - The current CPU's queue head counter >= the recorded tail counter
371   - The current CPU is unset (>= nr_cpu_ids)
372   - The current CPU is offline
374 After this check, the packet is sent to the (possibly updated) current
375 CPU. These rules aim to ensure that a flow only moves to a new CPU when
377 packets could arrive later than those about to be processed on the new
382 -----------------
390 The number of entries in the per-queue flow table are set through::
392   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
398 Both of these need to be set before RFS is enabled for a receive queue.
399 Values for both are rounded up to the nearest power of two. The
406 would normally be configured to the same value as rps_sock_flow_entries.
407 For a multi-queue device, the rps_flow_cnt for each queue might be
409 queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
417 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
418 balancing mechanism that uses soft state to steer flows based on where
421 directly to a CPU local to the thread consuming the data. The target CPU
423 which is local to the application thread’s CPU in the cache hierarchy.
425 To enable accelerated RFS, the networking stack calls the
426 ndo_rx_flow_steer driver function to communicate the desired hardware
430 method to program the NIC to steer the packets.
433 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
434 is maintained by the NIC driver. This is an auto-generated reverse map of
437 to populate the map. Alternatively, drivers can delegate the cpu_rmap
438 management to the Kernel by calling netif_enable_cpu_rmap(). For each CPU,
439 the corresponding queue in the map is set to be one whose processing CPU is
444 -----------------------------
449 of CPU to queues is automatically deduced from the IRQ affinities
457 This technique should be enabled whenever one wants to use RFS and the
465 which transmit queue to use when transmitting a packet on a multi-queue
467 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
468 to hardware transmit queue(s).
472 The goal of this mapping is usually to assign queues
473 exclusively to a subset of CPUs, where the transmit completions for
484 This mapping is used to pick transmit queue based on the receive
486 queues can be mapped to a set of transmit queues (many:many), although
489 busy polling multi-threaded workloads where there are challenges in
490 associating a given CPU to a given application thread. The application
491 threads are not pinned to CPUs and each thread handles packets
494 transmit queue corresponding to the associated receive queue has benefits
496 the same queue-association that a given application is polling on. This
503 CPUs/receive-queues that may use that queue to transmit. The reverse
504 mapping, from CPUs to transmit queues or from receive-queues to transmit
507 called to select a queue. This function uses the ID of the receive queue
508 for the socket connection for a match in the receive queue-to-transmit queue
510 running CPU as a key into the CPU-to-queue lookup table. If the
512 queues match, one is selected by using the flow hash to compute an index
519 This transmit queue is used for subsequent packets sent on the flow to
521 of calling get_xps_queues() over all packets in the flow. To avoid
523 skb->ooo_okay is set for a packet in the flow. This flag indicates that
531 -----------------
535 how, XPS is configured at device init. The mapping of CPUs/receive-queues
536 to transmit queue can be inspected and configured using sysfs:
540   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
542 For selection based on receive-queues map::
544   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
551 has no effect, since there is no choice in this case. In a multi-queue
552 system, XPS is preferably configured so that each CPU maps onto one queue.
553 If there are as many queues as there are CPUs in the system, then each
554 queue can also map onto one CPU, resulting in exclusive pairings that
556 best CPUs to share a given queue are probably those that share the cache
560 For transmit queue selection based on receive queue(s), XPS has to be
561 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
562 user configuration for receive-queue map does not apply, then the transmit
569 These are rate-limitation mechanisms implemented by HW, where currently
570 a max-rate attribute is supported, by setting a Mbps value to::
572   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
588 - Tom Herbert (therbert@google.com)
589 - Willem de Bruijn (willemb@google.com)