1===============
2 GPU Debugging
3===============
4
5General Debugging Options
6=========================
7
8The DebugFS section provides documentation on a number files to aid in debugging
9issues on the GPU.
10
11
12GPUVM Debugging
13===============
14
15To aid in debugging GPU virtual memory related problems, the driver supports a
16number of options module parameters:
17
18`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
19
20`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
21the GPU.
22
23
24Decoding a GPUVM Page Fault
25===========================
26
27If you see a GPU page fault in the kernel log, you can decode it to figure
28out what is going wrong in your application.  A page fault in your kernel
29log may look something like this:
30
31::
32
33 [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
34   in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
35 VM_L2_PROTECTION_FAULT_STATUS:0x00301030
36 	Faulty UTCL2 client ID: TCP (0x8)
37 	MORE_FAULTS: 0x0
38 	WALKER_ERROR: 0x0
39 	PERMISSION_FAULTS: 0x3
40 	MAPPING_ERROR: 0x0
41 	RW: 0x0
42
43First you have the memory hub, gfxhub and mmhub.  gfxhub is the memory
44hub used for graphics, compute, and sdma on some chips.  mmhub is the
45memory hub used for multi-media and sdma on some chips.
46
47Next you have the vmid and pasid.  If the vmid is 0, this fault was likely
48caused by the kernel driver or firmware.  If the vmid is non-0, it is generally
49a fault in a user application.  The pasid is used to link a vmid to a system
50process id.  If the process is active when the fault happens, the process
51information will be printed.
52
53The GPU virtual address that caused the fault comes next.
54
55The client ID indicates the GPU block that caused the fault.
56Some common client IDs:
57
58- CB/DB: The color/depth backend of the graphics pipe
59- CPF: Command Processor Frontend
60- CPC: Command Processor Compute
61- CPG: Command Processor Graphics
62- TCP/SQC/SQG: Shaders
63- SDMA: SDMA engines
64- VCN: Video encode/decode engines
65- JPEG: JPEG engines
66
67PERMISSION_FAULTS describe what faults were encountered:
68
69- bit 0: the PTE was not valid
70- bit 1: the PTE read bit was not set
71- bit 2: the PTE write bit was not set
72- bit 3: the PTE execute bit was not set
73
74Finally, RW, indicates whether the access was a read (0) or a write (1).
75
76In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
77an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
780x0000800102800000.  The user can then inspect their shader code and resource
79descriptor state to determine what caused the GPU page fault.
80
81UMR
82===
83
84`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
85GPU debugging and diagnostics tool.  Please see the umr
86`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
87about its capabilities.
88