1.. SPDX-License-Identifier: GPL-2.0
2
3=====================
4io_uring zero copy Rx
5=====================
6
7Introduction
8============
9
10io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
11the network receive path, allowing packet data to be received directly into
12userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
13there are no strict alignment requirements and no need to mmap()/munmap().
14Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
15processed by the kernel TCP stack as normal.
16
17NIC HW Requirements
18===================
19
20Several NIC HW features are required for io_uring ZC Rx to work. For now the
21kernel API does not configure the NIC and it must be done by the user.
22
23Header/data split
24-----------------
25
26Required to split packets at the L4 boundary into a header and a payload.
27Headers are received into kernel memory as normal and processed by the TCP
28stack as normal. Payloads are received into userspace memory directly.
29
30Flow steering
31-------------
32
33Specific HW Rx queues are configured for this feature, but modern NICs
34typically distribute flows across all HW Rx queues. Flow steering is required
35to ensure that only desired flows are directed towards HW queues that are
36configured for io_uring ZC Rx.
37
38RSS
39---
40
41In addition to flow steering above, RSS is required to steer all other non-zero
42copy flows away from queues that are configured for io_uring ZC Rx.
43
44Usage
45=====
46
47Setup NIC
48---------
49
50Must be done out of band for now.
51
52Ensure there are at least two queues::
53
54  ethtool -L eth0 combined 2
55
56Enable header/data split::
57
58  ethtool -G eth0 tcp-data-split on
59
60Carve out half of the HW Rx queues for zero copy using RSS::
61
62  ethtool -X eth0 equal 1
63
64Set up flow steering, bearing in mind that queues are 0-indexed::
65
66  ethtool -N eth0 flow-type tcp6 ... action 1
67
68Setup io_uring
69--------------
70
71This section describes the low level io_uring kernel API. Please refer to
72liburing documentation for how to use the higher level API.
73
74Create an io_uring instance with the following required setup flags::
75
76  IORING_SETUP_SINGLE_ISSUER
77  IORING_SETUP_DEFER_TASKRUN
78  IORING_SETUP_CQE32
79
80Create memory area
81------------------
82
83Allocate userspace memory area for receiving zero copy data::
84
85  void *area_ptr = mmap(NULL, area_size,
86                        PROT_READ | PROT_WRITE,
87                        MAP_ANONYMOUS | MAP_PRIVATE,
88                        0, 0);
89
90Create refill ring
91------------------
92
93Allocate memory for a shared ringbuf used for returning consumed buffers::
94
95  void *ring_ptr = mmap(NULL, ring_size,
96                        PROT_READ | PROT_WRITE,
97                        MAP_ANONYMOUS | MAP_PRIVATE,
98                        0, 0);
99
100This refill ring consists of some space for the header, followed by an array of
101``struct io_uring_zcrx_rqe``::
102
103  size_t rq_entries = 4096;
104  size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
105  /* align to page size */
106  ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
107
108Register ZC Rx
109--------------
110
111Fill in registration structs::
112
113  struct io_uring_zcrx_area_reg area_reg = {
114    .addr = (__u64)(unsigned long)area_ptr,
115    .len = area_size,
116    .flags = 0,
117  };
118
119  struct io_uring_region_desc region_reg = {
120    .user_addr = (__u64)(unsigned long)ring_ptr,
121    .size = ring_size,
122    .flags = IORING_MEM_REGION_TYPE_USER,
123  };
124
125  struct io_uring_zcrx_ifq_reg reg = {
126    .if_idx = if_nametoindex("eth0"),
127    /* this is the HW queue with desired flow steered into it */
128    .if_rxq = 1,
129    .rq_entries = rq_entries,
130    .area_ptr = (__u64)(unsigned long)&area_reg,
131    .region_ptr = (__u64)(unsigned long)&region_reg,
132  };
133
134Register with kernel::
135
136  io_uring_register_ifq(ring, &reg);
137
138Map refill ring
139---------------
140
141The kernel fills in fields for the refill ring in the registration ``struct
142io_uring_zcrx_ifq_reg``. Map it into userspace::
143
144  struct io_uring_zcrx_rq refill_ring;
145
146  refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
147  refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
148  refill_ring.rqes =
149    (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
150  refill_ring.rq_tail = 0;
151  refill_ring.ring_ptr = ring_ptr;
152
153Receiving data
154--------------
155
156Prepare a zero copy recv request::
157
158  struct io_uring_sqe *sqe;
159
160  sqe = io_uring_get_sqe(ring);
161  io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
162  sqe->ioprio |= IORING_RECV_MULTISHOT;
163
164Now, submit and wait::
165
166  io_uring_submit_and_wait(ring, 1);
167
168Finally, process completions::
169
170  struct io_uring_cqe *cqe;
171  unsigned int count = 0;
172  unsigned int head;
173
174  io_uring_for_each_cqe(ring, head, cqe) {
175    struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
176
177    unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
178    unsigned char *data = area_ptr + (rcqe->off & mask);
179    /* do something with the data */
180
181    count++;
182  }
183  io_uring_cq_advance(ring, count);
184
185Recycling buffers
186-----------------
187
188Return buffers back to the kernel to be used again::
189
190  struct io_uring_zcrx_rqe *rqe;
191  unsigned mask = refill_ring.ring_entries - 1;
192  rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
193
194  unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK;
195  rqe->off = area_offset | area_reg.rq_area_token;
196  rqe->len = cqe->res;
197  IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
198
199Testing
200=======
201
202See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``
203