drm/xe/xe_vm_doc.h

1 /* SPDX-License-Identifier: MIT */
16  * bind engine, and return a handle to the user.
19  * ------------
29  * in / out fence interface (struct drm_xe_sync) as execs which allows users to
30  * think of binds and execs as more or less the same operation.
33  * ----------
35  * DRM_XE_VM_BIND_OP_MAP		- Create mapping for a BO
36  * DRM_XE_VM_BIND_OP_UNMAP		- Destroy mapping for a BO / userptr
37  * DRM_XE_VM_BIND_OP_MAP_USERPTR	- Create mapping for userptr
43  * and GPU to modify page tables. If a new physical page is allocated in the
44  * page table structure we populate that page via the CPU and insert that new
46  * pages in the page table structure that need to be modified also are updated
48  * GPU job will always have at least 1 update. The in / out fences are passed to
52  * and the resulting operations:
54  * .. code-block::
56  *	bind BO0 0x0-0x1000
62  *	bind BO1 0x201000-0x202000
66  *	bind BO2 0x1ff000-0x201000
73  * In the above example the steps using the GPU can be converted to CPU if the
74  * bind can be done immediately (all in-fences satisfied, VM dma-resv kernel
78  * -------------
83  * ----------
85  * The minimum page size is either 4k or 64k depending on platform and memory
89  * Larger pages (2M or 1GB) can be used for BOs in VRAM, the BO physical address
90  * is aligned to the larger pages size, and VA is aligned to the larger page
91  * size. Larger pages for userptrs / BOs in sysmem should be possible but is not
95  * ------------------------
97  * In both modes during the bind IOCTL the user input is validated. In sync
100  * CPU and the job to do the GPU binds is created in the IOCTL itself. This step
101  * can fail due to memory pressure. The user can recover by freeing memory and
105  * -------------------------
107  * In async error handling the step of validating the BO, updating page tables,
108  * and generating a job are deferred to an async worker. As this step can now
122  * ---------------------
124  * Think of the case where we have two bind operations A + B and are submitted
125  * in that order. A has in fences while B has none. If using a single bind
126  * queue, B is now blocked on A's in fences even though it is ready to run. This
127  * example is a real use case for VK sparse binding. We work around this
130  * In the bind IOCTL the user can optionally pass in an engine ID which must map
134  * engine's ring. In the example above if A and B have different bind engines B
138  * TODO: Explain race in issue 41 and how we solve it
141  * ------------------------
143  * The uAPI allows multiple binds operations to be passed in via a user array,
144  * of struct drm_xe_vm_bind_op, in a single VM bind IOCTL. This interface
146  * the array into a list of operations, pass the in fences to the first operation,
147  * and pass the out fences to the last operation. The ordered nature of a bind
151  * ----------------------------
155  * .. code-block::
157  *	0x0000-0x2000 and 0x3000-0x5000 have mappings
158  *	Munmap 0x1000-0x4000, results in mappings 0x0000-0x1000 and 0x4000-0x5000
160  * To support this semantic in the above example we decompose the above example
163  * .. code-block::
165  *	unbind 0x0000-0x2000
166  *	unbind 0x3000-0x5000
167  *	rebind 0x0000-0x1000
168  *	rebind 0x4000-0x5000
170  * Why not just do a partial unbind of 0x1000-0x2000 and 0x3000-0x4000? This
171  * falls apart when using large pages at the edges and the unbind forces us to
173  * unmapping anything in the range and at most 2 rebinds on the edges.
175  * Similar to an array of binds, in fences are passed to the first operation and
178  * In this example there is a window of time where 0x0000-0x1000 and
179  * 0x4000-0x5000 are invalid but the user didn't ask for these addresses to be
180  * removed from the mapping. To work around this we treat any munmap style
184  * complete / triggers preempt fences) and the last operation is installed in
186  * VM). The caveat is all dma-resv slots must be updated atomically with respect
187  * to execs and compute mode rebind worker. To accomplish this, hold the
188  * vm->lock in write mode from the first operation until the last.
190  * Deferred binds in fault mode
191  * ----------------------------
193  * In a VM is in fault mode (TODO: link to fault mode), new bind operations that
203  * user wants to create a GPU mapping. Typically in other DRM drivers a dummy BO
204  * was created and then a binding was created. We bypass creating a dummy BO in
205  * XE and simply create a binding directly from the userptr.
208  * ------------
214  * idle to ensure no faults. This done by waiting on all of VM's dma-resv slots.
217  * -------
219  * Either the next exec (non-compute) or rebind worker (compute mode) will
221  * after the VM dma-resv wait if the VM is in compute mode.
226  * A VM in compute mode enables long running workloads and ultra low latency
229  * into the continuously running batch. In both cases these batches exceed the
231  * are not used when a VM is in compute mode. User fences (TODO: link user fence
235  * --------------
237  * If the kernel decides to move memory around (either userptr invalidate, BO
238  * eviction, or mumap style unbind which results in a rebind) and a batch is
240  * page tables for the moved memory are no longer valid. To work around this we
243  * hardware and the preempt fence signals when the engine is off the hardware.
245  * memory and kick the rebind worker which resumes all the engines execution.
248  * dma-resv DMA_RESV_USAGE_PREEMPT_FENCE slot. The same preempt fence, for every
249  * engine using the VM, is also installed into the same dma-resv slot of every
250  * external BO mapped in the VM.
253  * -------------
257  * fences, and finally resuming executing of engines in the VM.
262  * .. code-block::
264  *	<----------------------------------------------------------------------|
266  *	Lock VM global lock in read mode                                       |
268  *	Lock VM dma-resv and external BOs dma-resv                             |
270  *	Wait on and allocate new preempt fences for every engine using the VM  |
273  *	Wait VM's DMA_RESV_USAGE_KERNEL dma-resv slot                          |
274  *	Install preeempt fences and issue resume for every engine using the VM |
278  *		Wait all VM's dma-resv slots                                   |
279  *		Retry ----------------------------------------------------------
284  * -----------
286  * In order to prevent an engine from continuously being kicked off the hardware
287  * and making no forward progress an engine has a period of time it allowed to
294  * If a GT has slower access to some regions and the page table structure are in
296  * work around this we allow a VM page tables to be shadowed in multiple GTs.
297  * When VM is created, a default bind engine and PT table structure are created
300  * Binds can optionally pass in a mask of GTs where a mapping should be created,
304  * The implementation for this breaks down into a bunch for_each_gt loops in
305  * various places plus exporting a composite fence for multi-GT binds to the
311  * A VM in fault mode can be enabled on devices that support page faults. If
314  * signaling, and memory allocation is usually required to resolve a page
316  * such, dma fences are not allowed when VM is in fault mode. Because dma-fences
317  * are not allowed, long running workloads and ULLS are enabled on a faulting
321  * ----------------
323  * By default, on a faulting VM binds just allocate the VMA and the actual
325  * behavior can be overridden by setting the flag DRM_XE_VM_BIND_FLAG_IMMEDIATE in
329  * ------------------
331  * Page faults are received in the G2H worker under the CT lock which is in the
334  * is faults issue TLB invalidations which require G2H credits and we cannot
335  * allocate G2H credits in the G2H handlers without deadlocking. Lastly, we do
339  * To work around the above issue with processing faults in the G2H worker, we
341  * the GT (1 per hardware engine) and kick a worker to process the faults. Since
342  * the page faults G2H are already received in a worker, kicking another worker
343  * adds more latency to a critical performance path. We add a fast path in the
344  * G2H irq handler which looks at first G2H and if it is a page fault we sink
345  * the fault to the buffer and kick the worker to process the fault. TLB
346  * invalidation responses are also in the critical path so these can also be
347  * processed in this fast path.
349  * Multiple buffers and workers are used and hashed over based on the ASID so
350  * faults from different VMs can be processed in parallel.
354  * .. code-block::
356  *	Lookup VM from ASID in page fault G2H
357  *	Lock VM global lock in read mode
358  *	Lookup VMA from address in page fault G2H
361  *	<----------------------------------------------------------------------|
363  *	Lock VM & BO dma-resv locks                                            |
368  *		Drop VM & BO dma-resv locks                                    |
369  *		Retry ----------------------------------------------------------
375  * ---------------
378  * accessing VMAs in system memory frequently as hint to migrate those VMAs to
384  * simply drop the G2H. Access counters are a best case optimization and it is
389  * .. code-block::
391  *	Lookup VM from ASID in access counter G2H
392  *	Lock VM global lock in read mode
393  *	Lookup VMA from address in access counter G2H
395  *	Lock VM & BO dma-resv locks
399  * Notice no rebind is issued in the access counter handler as the rebind will
403  * ------------------------------------------------
405  * In the case of eviction and user pointer invalidation on a faulting VM, there
407  * for the VMAs and the page fault handler will rebind the VMAs when they fault.
409  * neeeed. In both the case of eviction and user pointer invalidation locks are
410  * held which make acquiring the VM global lock impossible. To work around this
414  * kernel to move the VMA's memory around. This is a necessary lockless
415  * algorithm and is safe as leafs cannot be changed while either an eviction or
422  * evictions, and compute mode rebind worker) in XE.
425  * -----
427  * VM global lock (vm->lock) - rw semaphore lock. Outer most lock which protects
428  * the list of userptrs mapped in the VM, the list of engines using this VM, and
429  * the array of external BOs mapped in the VM. When adding or removing any of the
430  * aforemented state from the VM should acquire this lock in write mode. The VM
431  * bind path also acquires this lock in write while the exec / compute mode
432  * rebind worker acquire this lock in read mode.
434  * VM dma-resv lock (vm->ttm.base.resv->lock) - WW lock. Protects VM dma-resv
435  * slots which is shared with any private BO in the VM. Expected to be acquired
436  * during VM binds, execs, and compute mode rebind worker. This lock is also
439  * external BO dma-resv lock (bo->ttm.base.resv->lock) - WW lock. Protects
440  * external BO dma-resv slots. Expected to be acquired during VM binds (in
441  * addition to the VM dma-resv lock). All external BO dma-locks within a VM are
442  * expected to be acquired (in addition to the VM dma-resv lock) during execs
443  * and the compute mode rebind worker. This lock is also held when an external
447  * -----------------------
449  * 1. An exec and bind operation with the same VM can't be executing at the same
450  * time (vm->lock).
452  * 2. A compute mode rebind worker and bind operation with the same VM can't be
453  * executing at the same time (vm->lock).
456  * the same VM is executing (vm->lock).
459  * compute mode rebind worker with the same VM is executing (vm->lock).
462  * executing (dma-resv locks).
465  * with the same VM is executing (dma-resv locks).
467  * dma-resv usage
471  * invalidation, munmap style unbinds which result in a rebind), rebinds during
472  * execs, execs, and resumes in the rebind worker we use both the VMs and
473  * external BOs dma-resv slots. Let try to make this as clear as possible.
476  * -----------------
482  * 2. In non-compute mode, jobs from execs install themselves into the
485  * 3. In non-compute mode, jobs from execs install themselves into the
486  * DMA_RESV_USAGE_WRITE slot of all external BOs in the VM
494  * 6. Every engine using a compute mode VM has a preempt fence in installed into
497  * 7. Every engine using a compute mode VM has a preempt fence in installed into
498  * the DMA_RESV_USAGE_PREEMPT_FENCE slot of all the external BOs in the VM
501  * ------------
507  * 2. In non-compute mode, the exection of all jobs from rebinds in execs shall
511  * 3. In non-compute mode, the exection of all jobs from execs shall wait on the
514  * 4. In compute mode, the exection of all jobs from rebinds in the rebind
518  * 5. In compute mode, resumes in rebind worker shall wait on last rebind fence
520  * 6. In compute mode, resumes in rebind worker shall wait on the
524  * -----------------------
527  * non-compute mode execs
529  * 2. New jobs from non-compute mode execs are blocked behind any existing jobs
530  * from kernel ops and rebinds
532  * 3. New jobs from kernel ops are blocked behind all preempt fences signaling in
536  * kernel ops and rebinds
541  * Support large pages for sysmem and userptr.
544  * could be in system memory while another part could be in VRAM).
547  * wait on the dma-resv kernel slots of VM or BO, technically we only have to
548  * wait the BO moving. If using a job to do the rebind, we could not block in
552  * benchmarks / performance number from workloads up and running.