1.. SPDX-License-Identifier: GPL-2.0 2 3===================== 4io_uring zero copy Rx 5===================== 6 7Introduction 8============ 9 10io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on 11the network receive path, allowing packet data to be received directly into 12userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that 13there are no strict alignment requirements and no need to mmap()/munmap(). 14Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are 15processed by the kernel TCP stack as normal. 16 17NIC HW Requirements 18=================== 19 20Several NIC HW features are required for io_uring ZC Rx to work. For now the 21kernel API does not configure the NIC and it must be done by the user. 22 23Header/data split 24----------------- 25 26Required to split packets at the L4 boundary into a header and a payload. 27Headers are received into kernel memory as normal and processed by the TCP 28stack as normal. Payloads are received into userspace memory directly. 29 30Flow steering 31------------- 32 33Specific HW Rx queues are configured for this feature, but modern NICs 34typically distribute flows across all HW Rx queues. Flow steering is required 35to ensure that only desired flows are directed towards HW queues that are 36configured for io_uring ZC Rx. 37 38RSS 39--- 40 41In addition to flow steering above, RSS is required to steer all other non-zero 42copy flows away from queues that are configured for io_uring ZC Rx. 43 44Usage 45===== 46 47Setup NIC 48--------- 49 50Must be done out of band for now. 51 52Ensure there are at least two queues:: 53 54 ethtool -L eth0 combined 2 55 56Enable header/data split:: 57 58 ethtool -G eth0 tcp-data-split on 59 60Carve out half of the HW Rx queues for zero copy using RSS:: 61 62 ethtool -X eth0 equal 1 63 64Set up flow steering, bearing in mind that queues are 0-indexed:: 65 66 ethtool -N eth0 flow-type tcp6 ... action 1 67 68Setup io_uring 69-------------- 70 71This section describes the low level io_uring kernel API. Please refer to 72liburing documentation for how to use the higher level API. 73 74Create an io_uring instance with the following required setup flags:: 75 76 IORING_SETUP_SINGLE_ISSUER 77 IORING_SETUP_DEFER_TASKRUN 78 IORING_SETUP_CQE32 79 80Create memory area 81------------------ 82 83Allocate userspace memory area for receiving zero copy data:: 84 85 void *area_ptr = mmap(NULL, area_size, 86 PROT_READ | PROT_WRITE, 87 MAP_ANONYMOUS | MAP_PRIVATE, 88 0, 0); 89 90Create refill ring 91------------------ 92 93Allocate memory for a shared ringbuf used for returning consumed buffers:: 94 95 void *ring_ptr = mmap(NULL, ring_size, 96 PROT_READ | PROT_WRITE, 97 MAP_ANONYMOUS | MAP_PRIVATE, 98 0, 0); 99 100This refill ring consists of some space for the header, followed by an array of 101``struct io_uring_zcrx_rqe``:: 102 103 size_t rq_entries = 4096; 104 size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE; 105 /* align to page size */ 106 ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1); 107 108Register ZC Rx 109-------------- 110 111Fill in registration structs:: 112 113 struct io_uring_zcrx_area_reg area_reg = { 114 .addr = (__u64)(unsigned long)area_ptr, 115 .len = area_size, 116 .flags = 0, 117 }; 118 119 struct io_uring_region_desc region_reg = { 120 .user_addr = (__u64)(unsigned long)ring_ptr, 121 .size = ring_size, 122 .flags = IORING_MEM_REGION_TYPE_USER, 123 }; 124 125 struct io_uring_zcrx_ifq_reg reg = { 126 .if_idx = if_nametoindex("eth0"), 127 /* this is the HW queue with desired flow steered into it */ 128 .if_rxq = 1, 129 .rq_entries = rq_entries, 130 .area_ptr = (__u64)(unsigned long)&area_reg, 131 .region_ptr = (__u64)(unsigned long)®ion_reg, 132 }; 133 134Register with kernel:: 135 136 io_uring_register_ifq(ring, ®); 137 138Map refill ring 139--------------- 140 141The kernel fills in fields for the refill ring in the registration ``struct 142io_uring_zcrx_ifq_reg``. Map it into userspace:: 143 144 struct io_uring_zcrx_rq refill_ring; 145 146 refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head); 147 refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail); 148 refill_ring.rqes = 149 (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes); 150 refill_ring.rq_tail = 0; 151 refill_ring.ring_ptr = ring_ptr; 152 153Receiving data 154-------------- 155 156Prepare a zero copy recv request:: 157 158 struct io_uring_sqe *sqe; 159 160 sqe = io_uring_get_sqe(ring); 161 io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0); 162 sqe->ioprio |= IORING_RECV_MULTISHOT; 163 164Now, submit and wait:: 165 166 io_uring_submit_and_wait(ring, 1); 167 168Finally, process completions:: 169 170 struct io_uring_cqe *cqe; 171 unsigned int count = 0; 172 unsigned int head; 173 174 io_uring_for_each_cqe(ring, head, cqe) { 175 struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); 176 177 unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1; 178 unsigned char *data = area_ptr + (rcqe->off & mask); 179 /* do something with the data */ 180 181 count++; 182 } 183 io_uring_cq_advance(ring, count); 184 185Recycling buffers 186----------------- 187 188Return buffers back to the kernel to be used again:: 189 190 struct io_uring_zcrx_rqe *rqe; 191 unsigned mask = refill_ring.ring_entries - 1; 192 rqe = &refill_ring.rqes[refill_ring.rq_tail & mask]; 193 194 unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK; 195 rqe->off = area_offset | area_reg.rq_area_token; 196 rqe->len = cqe->res; 197 IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail); 198 199Testing 200======= 201 202See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c`` 203