1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4Device Memory TCP 5================= 6 7 8Intro 9===== 10 11Device memory TCP (devmem TCP) enables receiving data directly into device 12memory (dmabuf). The feature is currently implemented for TCP sockets. 13 14 15Opportunity 16----------- 17 18A large number of data transfers have device memory as the source and/or 19destination. Accelerators drastically increased the prevalence of such 20transfers. Some examples include: 21 22- Distributed training, where ML accelerators, such as GPUs on different hosts, 23 exchange data. 24 25- Distributed raw block storage applications transfer large amounts of data with 26 remote SSDs. Much of this data does not require host processing. 27 28Typically the Device-to-Device data transfers in the network are implemented as 29the following low-level operations: Device-to-Host copy, Host-to-Host network 30transfer, and Host-to-Device copy. 31 32The flow involving host copies is suboptimal, especially for bulk data transfers, 33and can put significant strains on system resources such as host memory 34bandwidth and PCIe bandwidth. 35 36Devmem TCP optimizes this use case by implementing socket APIs that enable 37the user to receive incoming network packets directly into device memory. 38 39Packet payloads go directly from the NIC to device memory. 40 41Packet headers go to host memory and are processed by the TCP/IP stack 42normally. The NIC must support header split to achieve this. 43 44Advantages: 45 46- Alleviate host memory bandwidth pressure, compared to existing 47 network-transfer + device-copy semantics. 48 49- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest 50 level of the PCIe tree, compared to the traditional path which sends data 51 through the root complex. 52 53 54More Info 55--------- 56 57 slides, video 58 https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html 59 60 patchset 61 [PATCH net-next v24 00/13] Device Memory TCP 62 https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ 63 64 65RX Interface 66============ 67 68 69Example 70------- 71 72./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of 73setting up the RX path of this API. 74 75 76NIC Setup 77--------- 78 79Header split, flow steering, & RSS are required features for devmem TCP. 80 81Header split is used to split incoming packets into a header buffer in host 82memory, and a payload buffer in device memory. 83 84Flow steering & RSS are used to ensure that only flows targeting devmem land on 85an RX queue bound to devmem. 86 87Enable header split & flow steering:: 88 89 # enable header split 90 ethtool -G eth1 tcp-data-split on 91 92 93 # enable flow steering 94 ethtool -K eth1 ntuple on 95 96Configure RSS to steer all traffic away from the target RX queue (queue 15 in 97this example):: 98 99 ethtool --set-rxfh-indir eth1 equal 15 100 101 102The user must bind a dmabuf to any number of RX queues on a given NIC using 103the netlink API:: 104 105 /* Bind dmabuf to NIC RX queue 15 */ 106 struct netdev_queue *queues; 107 queues = malloc(sizeof(*queues) * 1); 108 109 queues[0]._present.type = 1; 110 queues[0]._present.idx = 1; 111 queues[0].type = NETDEV_RX_QUEUE_TYPE_RX; 112 queues[0].idx = 15; 113 114 *ys = ynl_sock_create(&ynl_netdev_family, &yerr); 115 116 req = netdev_bind_rx_req_alloc(); 117 netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */); 118 netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); 119 __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); 120 121 rsp = netdev_bind_rx(*ys, req); 122 123 dmabuf_id = rsp->dmabuf_id; 124 125 126The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf 127that has been bound. 128 129The user can unbind the dmabuf from the netdevice by closing the netlink socket 130that established the binding. We do this so that the binding is automatically 131unbound even if the userspace process crashes. 132 133Note that any reasonably well-behaved dmabuf from any exporter should work with 134devmem TCP, even if the dmabuf is not actually backed by devmem. An example of 135this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. 136 137 138Socket Setup 139------------ 140 141The socket must be flow steered to the dmabuf bound RX queue:: 142 143 ethtool -N eth1 flow-type tcp4 ... queue 15 144 145 146Receiving data 147-------------- 148 149The user application must signal to the kernel that it is capable of receiving 150devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg:: 151 152 ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM); 153 154Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT 155on devmem data. 156 157Devmem data is received directly into the dmabuf bound to the NIC in 'NIC 158Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs:: 159 160 for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { 161 if (cm->cmsg_level != SOL_SOCKET || 162 (cm->cmsg_type != SCM_DEVMEM_DMABUF && 163 cm->cmsg_type != SCM_DEVMEM_LINEAR)) 164 continue; 165 166 dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); 167 168 if (cm->cmsg_type == SCM_DEVMEM_DMABUF) { 169 /* Frag landed in dmabuf. 170 * 171 * dmabuf_cmsg->dmabuf_id is the dmabuf the 172 * frag landed on. 173 * 174 * dmabuf_cmsg->frag_offset is the offset into 175 * the dmabuf where the frag starts. 176 * 177 * dmabuf_cmsg->frag_size is the size of the 178 * frag. 179 * 180 * dmabuf_cmsg->frag_token is a token used to 181 * refer to this frag for later freeing. 182 */ 183 184 struct dmabuf_token token; 185 token.token_start = dmabuf_cmsg->frag_token; 186 token.token_count = 1; 187 continue; 188 } 189 190 if (cm->cmsg_type == SCM_DEVMEM_LINEAR) 191 /* Frag landed in linear buffer. 192 * 193 * dmabuf_cmsg->frag_size is the size of the 194 * frag. 195 */ 196 continue; 197 198 } 199 200Applications may receive 2 cmsgs: 201 202- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated 203 by dmabuf_id. 204 205- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer. 206 This typically happens when the NIC is unable to split the packet at the 207 header boundary, such that part (or all) of the payload landed in host 208 memory. 209 210Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem, 211regular TCP data that landed on an RX queue not bound to a dmabuf. 212 213 214Freeing frags 215------------- 216 217Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user 218processes the frag. The user must return the frag to the kernel via 219SO_DEVMEM_DONTNEED:: 220 221 ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, 222 sizeof(token)); 223 224The user must ensure the tokens are returned to the kernel in a timely manner. 225Failure to do so will exhaust the limited dmabuf that is bound to the RX queue 226and will lead to packet drops. 227 228The user must pass no more than 128 tokens, with no more than 1024 total frags 229among the token->token_count across all the tokens. If the user provides more 230than 1024 frags, the kernel will free up to 1024 frags and return early. 231 232The kernel returns the number of actual frags freed. The number of frags freed 233can be less than the tokens provided by the user in case of: 234 235(a) an internal kernel leak bug. 236(b) the user passed more than 1024 frags. 237 238TX Interface 239============ 240 241 242Example 243------- 244 245./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of 246setting up the TX path of this API. 247 248 249NIC Setup 250--------- 251 252The user must bind a TX dmabuf to a given NIC using the netlink API:: 253 254 struct netdev_bind_tx_req *req = NULL; 255 struct netdev_bind_tx_rsp *rsp = NULL; 256 struct ynl_error yerr; 257 258 *ys = ynl_sock_create(&ynl_netdev_family, &yerr); 259 260 req = netdev_bind_tx_req_alloc(); 261 netdev_bind_tx_req_set_ifindex(req, ifindex); 262 netdev_bind_tx_req_set_fd(req, dmabuf_fd); 263 264 rsp = netdev_bind_tx(*ys, req); 265 266 tx_dmabuf_id = rsp->id; 267 268 269The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf 270that has been bound. 271 272The user can unbind the dmabuf from the netdevice by closing the netlink socket 273that established the binding. We do this so that the binding is automatically 274unbound even if the userspace process crashes. 275 276Note that any reasonably well-behaved dmabuf from any exporter should work with 277devmem TCP, even if the dmabuf is not actually backed by devmem. An example of 278this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. 279 280Socket Setup 281------------ 282 283The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem 284cannot be copied by the kernel, so the semantics of the devmem TX are similar 285to the semantics of MSG_ZEROCOPY:: 286 287 setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); 288 289It is also recommended that the user binds the TX socket to the same interface 290the dma-buf has been bound to via SO_BINDTODEVICE:: 291 292 setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1); 293 294 295Sending data 296------------ 297 298Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg. 299 300The user should create a msghdr where, 301 302* iov_base is set to the offset into the dmabuf to start sending from 303* iov_len is set to the number of bytes to be sent from the dmabuf 304 305The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id. 306 307The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048 308from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id:: 309 310 char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))]; 311 struct dmabuf_tx_cmsg ddmabuf; 312 struct msghdr msg = {}; 313 struct cmsghdr *cmsg; 314 struct iovec iov[2]; 315 316 iov[0].iov_base = (void*)100; 317 iov[0].iov_len = 1024; 318 iov[1].iov_base = (void*)2000; 319 iov[1].iov_len = 2048; 320 321 msg.msg_iov = iov; 322 msg.msg_iovlen = 2; 323 324 msg.msg_control = ctrl_data; 325 msg.msg_controllen = sizeof(ctrl_data); 326 327 cmsg = CMSG_FIRSTHDR(&msg); 328 cmsg->cmsg_level = SOL_SOCKET; 329 cmsg->cmsg_type = SCM_DEVMEM_DMABUF; 330 cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)); 331 332 ddmabuf.dmabuf_id = tx_dmabuf_id; 333 334 *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf; 335 336 sendmsg(socket_fd, &msg, MSG_ZEROCOPY); 337 338 339Reusing TX dmabufs 340------------------ 341 342Similar to MSG_ZEROCOPY with regular memory, the user should not modify the 343contents of the dma-buf while a send operation is in progress. This is because 344the kernel does not keep a copy of the dmabuf contents. Instead, the kernel 345will pin and send data from the buffer available to the userspace. 346 347Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions 348using MSG_ERRQUEUE:: 349 350 int64_t tstop = gettimeofday_ms() + waittime_ms; 351 char control[CMSG_SPACE(100)] = {}; 352 struct sock_extended_err *serr; 353 struct msghdr msg = {}; 354 struct cmsghdr *cm; 355 int retries = 10; 356 __u32 hi, lo; 357 358 msg.msg_control = control; 359 msg.msg_controllen = sizeof(control); 360 361 while (gettimeofday_ms() < tstop) { 362 if (!do_poll(fd)) continue; 363 364 ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 365 366 for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { 367 serr = (void *)CMSG_DATA(cm); 368 369 hi = serr->ee_data; 370 lo = serr->ee_info; 371 372 fprintf(stdout, "tx complete [%d,%d]\n", lo, hi); 373 } 374 } 375 376After the associated sendmsg has been completed, the dmabuf can be reused by 377the userspace. 378 379 380Implementation & Caveats 381======================== 382 383Unreadable skbs 384--------------- 385 386Devmem payloads are inaccessible to the kernel processing the packets. This 387results in a few quirks for payloads of devmem skbs: 388 389- Loopback is not functional. Loopback relies on copying the payload, which is 390 not possible with devmem skbs. 391 392- Software checksum calculation fails. 393 394- TCP Dump and bpf can't access devmem packet payloads. 395 396 397Testing 398======= 399 400More realistic example code can be found in the kernel source under 401``tools/testing/selftests/drivers/net/hw/ncdevmem.c`` 402 403ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but 404receives data directly into a udmabuf. 405 406To run ncdevmem, you need to run it on a server on the machine under test, and 407you need to run netcat on a peer to provide the TX data. 408 409ncdevmem has a validation mode as well that expects a repeating pattern of 410incoming data and validates it as such. For example, you can launch 411ncdevmem on the server by:: 412 413 ncdevmem -s <server IP> -c <client IP> -f <ifname> -l -p 5201 -v 7 414 415On client side, use regular netcat to send TX data to ncdevmem process 416on the server:: 417 418 yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \ 419 tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201 420