1f227e04eSMike Rapoport================================================== 2f227e04eSMike Rapoportpage owner: Tracking about who allocated each page 3f227e04eSMike Rapoport================================================== 4f227e04eSMike Rapoport 5f227e04eSMike RapoportIntroduction 6f227e04eSMike Rapoport============ 716a7ade8SJoonsoo Kim 816a7ade8SJoonsoo Kimpage owner is for the tracking about who allocated each page. 916a7ade8SJoonsoo KimIt can be used to debug memory leak or to find a memory hogger. 1016a7ade8SJoonsoo KimWhen allocation happens, information about allocation such as call stack 1116a7ade8SJoonsoo Kimand order of pages is stored into certain storage for each page. 1216a7ade8SJoonsoo KimWhen we need to know about status of all pages, we can get and analyze 1316a7ade8SJoonsoo Kimthis information. 1416a7ade8SJoonsoo Kim 1516a7ade8SJoonsoo KimAlthough we already have tracepoint for tracing page allocation/free, 1616a7ade8SJoonsoo Kimusing it for analyzing who allocate each page is rather complex. We need 1716a7ade8SJoonsoo Kimto enlarge the trace buffer for preventing overlapping until userspace 1816a7ade8SJoonsoo Kimprogram launched. And, launched program continually dump out the trace 1994ebdd28SColin Ian Kingbuffer for later analysis and it would change system behaviour with more 2016a7ade8SJoonsoo Kimpossibility rather than just keeping it in memory, so bad for debugging. 2116a7ade8SJoonsoo Kim 2216a7ade8SJoonsoo Kimpage owner can also be used for various purposes. For example, accurate 2316a7ade8SJoonsoo Kimfragmentation statistics can be obtained through gfp flag information of 2416a7ade8SJoonsoo Kimeach page. It is already implemented and activated if page owner is 2516a7ade8SJoonsoo Kimenabled. Other usages are more than welcome. 2616a7ade8SJoonsoo Kim 27f5c12105SOscar SalvadorIt can also be used to show all the stacks and their current number of 28f5c12105SOscar Salvadorallocated base pages, which gives us a quick overview of where the memory 29f5c12105SOscar Salvadoris going without the need to screen through all the pages and match the 30f5c12105SOscar Salvadorallocation and free operation. 31ba6fe537SOscar Salvador 32024314d6SYixuan Caopage owner is disabled by default. So, if you'd like to use it, you need 33024314d6SYixuan Caoto add "page_owner=on" to your boot cmdline. If the kernel is built 34024314d6SYixuan Caowith page owner and page owner is disabled in runtime due to not enabling 3516a7ade8SJoonsoo Kimboot option, runtime overhead is marginal. If disabled in runtime, it 3616a7ade8SJoonsoo Kimdoesn't require memory to store owner information, so there is no runtime 3716a7ade8SJoonsoo Kimmemory overhead. And, page owner inserts just two unlikely branches into 387dd80b8aSVlastimil Babkathe page allocator hotpath and if not enabled, then allocation is done 397dd80b8aSVlastimil Babkalike as the kernel without page owner. These two unlikely branches should 407dd80b8aSVlastimil Babkanot affect to allocation performance, especially if the static keys jump 417dd80b8aSVlastimil Babkalabel patching functionality is available. Following is the kernel's code 427dd80b8aSVlastimil Babkasize change due to this facility. 4316a7ade8SJoonsoo Kim 440719fdbaSYixuan CaoAlthough enabling page owner increases kernel size by several kilobytes, 450719fdbaSYixuan Caomost of this code is outside page allocator and its hot path. Building 460719fdbaSYixuan Caothe kernel with page owner and turning it on if needed would be great 470719fdbaSYixuan Caooption to debug kernel memory problem. 4816a7ade8SJoonsoo Kim 4916a7ade8SJoonsoo KimThere is one notice that is caused by implementation detail. page owner 5016a7ade8SJoonsoo Kimstores information into the memory from struct page extension. This memory 5116a7ade8SJoonsoo Kimis initialized some time later than that page allocator starts in sparse 5216a7ade8SJoonsoo Kimmemory system, so, until initialization, many pages can be allocated and 5316a7ade8SJoonsoo Kimthey would have no owner information. To fix it up, these early allocated 5416a7ade8SJoonsoo Kimpages are investigated and marked as allocated in initialization phase. 5516a7ade8SJoonsoo KimAlthough it doesn't mean that they have the right owner information, 5616a7ade8SJoonsoo Kimat least, we can tell whether the page is allocated or not, 5716a7ade8SJoonsoo Kimmore accurately. On 2GB memory x86-64 VM box, 13343 early allocated pages 58e7951a3eSChen Xiaoare caught and marked, although they are mostly allocated from struct 5916a7ade8SJoonsoo Kimpage extension feature. Anyway, after that, no page is left in 6016a7ade8SJoonsoo Kimun-tracking state. 6116a7ade8SJoonsoo Kim 62f227e04eSMike RapoportUsage 63f227e04eSMike Rapoport===== 6416a7ade8SJoonsoo Kim 65f227e04eSMike Rapoport1) Build user-space helper:: 66f227e04eSMike Rapoport 67799fb82aSSeongJae Park cd tools/mm 6816a7ade8SJoonsoo Kim make page_owner_sort 6916a7ade8SJoonsoo Kim 70f227e04eSMike Rapoport2) Enable page owner: add "page_owner=on" to boot cmdline. 7116a7ade8SJoonsoo Kim 7259d7cb27SJiajian Ye3) Do the job that you want to debug. 7316a7ade8SJoonsoo Kim 74f227e04eSMike Rapoport4) Analyze information from page owner:: 75f227e04eSMike Rapoport 76ba6fe537SOscar Salvador cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt 77ba6fe537SOscar Salvador cat stacks.txt 78f5c12105SOscar Salvador post_alloc_hook+0x177/0x1a0 79f5c12105SOscar Salvador get_page_from_freelist+0xd01/0xd80 80f5c12105SOscar Salvador __alloc_pages+0x39e/0x7e0 81f5c12105SOscar Salvador allocate_slab+0xbc/0x3f0 82f5c12105SOscar Salvador ___slab_alloc+0x528/0x8a0 83f5c12105SOscar Salvador kmem_cache_alloc+0x224/0x3b0 84f5c12105SOscar Salvador sk_prot_alloc+0x58/0x1a0 85f5c12105SOscar Salvador sk_alloc+0x32/0x4f0 86f5c12105SOscar Salvador inet_create+0x427/0xb50 87f5c12105SOscar Salvador __sock_create+0x2e4/0x650 88f5c12105SOscar Salvador inet_ctl_sock_create+0x30/0x180 89f5c12105SOscar Salvador igmp_net_init+0xc1/0x130 90f5c12105SOscar Salvador ops_init+0x167/0x410 91f5c12105SOscar Salvador setup_net+0x304/0xa60 92f5c12105SOscar Salvador copy_net_ns+0x29b/0x4a0 93f5c12105SOscar Salvador create_new_namespaces+0x4a1/0x820 94f5c12105SOscar Salvador nr_base_pages: 16 95ba6fe537SOscar Salvador ... 96ba6fe537SOscar Salvador ... 97ba6fe537SOscar Salvador echo 7000 > /sys/kernel/debug/page_owner_stacks/count_threshold 98ba6fe537SOscar Salvador cat /sys/kernel/debug/page_owner_stacks/show_stacks> stacks_7000.txt 99ba6fe537SOscar Salvador cat stacks_7000.txt 100f5c12105SOscar Salvador post_alloc_hook+0x177/0x1a0 101f5c12105SOscar Salvador get_page_from_freelist+0xd01/0xd80 102f5c12105SOscar Salvador __alloc_pages+0x39e/0x7e0 103f5c12105SOscar Salvador alloc_pages_mpol+0x22e/0x490 104f5c12105SOscar Salvador folio_alloc+0xd5/0x110 105f5c12105SOscar Salvador filemap_alloc_folio+0x78/0x230 106f5c12105SOscar Salvador page_cache_ra_order+0x287/0x6f0 107f5c12105SOscar Salvador filemap_get_pages+0x517/0x1160 108f5c12105SOscar Salvador filemap_read+0x304/0x9f0 109f5c12105SOscar Salvador xfs_file_buffered_read+0xe6/0x1d0 [xfs] 110f5c12105SOscar Salvador xfs_file_read_iter+0x1f0/0x380 [xfs] 111f5c12105SOscar Salvador __kernel_read+0x3b9/0x730 112f5c12105SOscar Salvador kernel_read_file+0x309/0x4d0 113f5c12105SOscar Salvador __do_sys_finit_module+0x381/0x730 114f5c12105SOscar Salvador do_syscall_64+0x8d/0x150 115f5c12105SOscar Salvador entry_SYSCALL_64_after_hwframe+0x62/0x6a 116f5c12105SOscar Salvador nr_base_pages: 20824 117ba6fe537SOscar Salvador ... 118ba6fe537SOscar Salvador 11916a7ade8SJoonsoo Kim cat /sys/kernel/debug/page_owner > page_owner_full.txt 1205b94ce2fSChanghee Han ./page_owner_sort page_owner_full.txt sorted_page_owner.txt 12116a7ade8SJoonsoo Kim 12218ab3078SJonathan Corbet The general output of ``page_owner_full.txt`` is as follows:: 123f7df2b1cSZhenliang Wei 124f7df2b1cSZhenliang Wei Page allocated via order XXX, ... 125f7df2b1cSZhenliang Wei PFN XXX ... 126f7df2b1cSZhenliang Wei // Detailed stack 127f7df2b1cSZhenliang Wei 128f7df2b1cSZhenliang Wei Page allocated via order XXX, ... 129f7df2b1cSZhenliang Wei PFN XXX ... 130f7df2b1cSZhenliang Wei // Detailed stack 1318f0efa81SKassey Li By default, it will do full pfn dump, to start with a given pfn, 1328f0efa81SKassey Li page_owner supports fseek. 1338f0efa81SKassey Li 1348f0efa81SKassey Li FILE *fp = fopen("/sys/kernel/debug/page_owner", "r"); 1358f0efa81SKassey Li fseek(fp, pfn_start, SEEK_SET); 136f7df2b1cSZhenliang Wei 137f7df2b1cSZhenliang Wei The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows 138f7df2b1cSZhenliang Wei in buf, uses regexp to extract the page order value, counts the times 13957f2b54aSShenghong Han and pages of buf, and finally sorts them according to the parameter(s). 140f7df2b1cSZhenliang Wei 14116a7ade8SJoonsoo Kim See the result about who allocated each page 14218ab3078SJonathan Corbet in the ``sorted_page_owner.txt``. General output:: 143f7df2b1cSZhenliang Wei 144f7df2b1cSZhenliang Wei XXX times, XXX pages: 145f7df2b1cSZhenliang Wei Page allocated via order XXX, ... 146f7df2b1cSZhenliang Wei // Detailed stack 147f7df2b1cSZhenliang Wei 148f7df2b1cSZhenliang Wei By default, ``page_owner_sort`` is sorted according to the times of buf. 14957f2b54aSShenghong Han If you want to sort by the page nums of buf, use the ``-m`` parameter. 15057f2b54aSShenghong Han The detailed parameters are: 15157f2b54aSShenghong Han 1525603f9bdSAkira Yokosawa fundamental function:: 15357f2b54aSShenghong Han 15457f2b54aSShenghong Han Sort: 15557f2b54aSShenghong Han -a Sort by memory allocation time. 15657f2b54aSShenghong Han -m Sort by total memory. 15757f2b54aSShenghong Han -p Sort by pid. 158cf3c2c86SJiajian Ye -P Sort by tgid. 159194d52d7SJiajian Ye -n Sort by task command name. 16057f2b54aSShenghong Han -r Sort by memory release time. 16157f2b54aSShenghong Han -s Sort by stack trace. 16257f2b54aSShenghong Han -t Sort by times (default). 163ebbeae36SJiajian Ye --sort <order> Specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]]. 164ebbeae36SJiajian Ye Choose a key from the **STANDARD FORMAT SPECIFIERS** section. The "+" is 165ebbeae36SJiajian Ye optional since default direction is increasing numerical or lexicographic 166ebbeae36SJiajian Ye order. Mixed use of abbreviated and complete-form of keys is allowed. 167ebbeae36SJiajian Ye 168ebbeae36SJiajian Ye Examples: 169ebbeae36SJiajian Ye ./page_owner_sort <input> <output> --sort=n,+pid,-tgid 170ebbeae36SJiajian Ye ./page_owner_sort <input> <output> --sort=at 17157f2b54aSShenghong Han 1725603f9bdSAkira Yokosawa additional function:: 17357f2b54aSShenghong Han 17457f2b54aSShenghong Han Cull: 1759c8a0a8eSJiajian Ye --cull <rules> 1769c8a0a8eSJiajian Ye Specify culling rules.Culling syntax is key[,key[,...]].Choose a 1779c8a0a8eSJiajian Ye multi-letter key from the **STANDARD FORMAT SPECIFIERS** section. 1789c8a0a8eSJiajian Ye 1799c8a0a8eSJiajian Ye <rules> is a single argument in the form of a comma-separated list, 1809c8a0a8eSJiajian Ye which offers a way to specify individual culling rules. The recognized 1819c8a0a8eSJiajian Ye keywords are described in the **STANDARD FORMAT SPECIFIERS** section below. 1829c8a0a8eSJiajian Ye <rules> can be specified by the sequence of keys k1,k2, ..., as described in 1839c8a0a8eSJiajian Ye the STANDARD SORT KEYS section below. Mixed use of abbreviated and 1849c8a0a8eSJiajian Ye complete-form of keys is allowed. 1859c8a0a8eSJiajian Ye 1869c8a0a8eSJiajian Ye Examples: 1879c8a0a8eSJiajian Ye ./page_owner_sort <input> <output> --cull=stacktrace 1889c8a0a8eSJiajian Ye ./page_owner_sort <input> <output> --cull=st,pid,name 1899c8a0a8eSJiajian Ye ./page_owner_sort <input> <output> --cull=n,f 19057f2b54aSShenghong Han 19157f2b54aSShenghong Han Filter: 19259d7cb27SJiajian Ye -f Filter out the information of blocks whose memory has been released. 1938ea8613aSJiajian Ye 1948ea8613aSJiajian Ye Select: 19575382a2dSJiajian Ye --pid <pidlist> Select by pid. This selects the blocks whose process ID 19675382a2dSJiajian Ye numbers appear in <pidlist>. 19775382a2dSJiajian Ye --tgid <tgidlist> Select by tgid. This selects the blocks whose thread 19875382a2dSJiajian Ye group ID numbers appear in <tgidlist>. 19975382a2dSJiajian Ye --name <cmdlist> Select by task command name. This selects the blocks whose 20075382a2dSJiajian Ye task command name appear in <cmdlist>. 20175382a2dSJiajian Ye 20275382a2dSJiajian Ye <pidlist>, <tgidlist>, <cmdlist> are single arguments in the form of a comma-separated list, 20375382a2dSJiajian Ye which offers a way to specify individual selecting rules. 20475382a2dSJiajian Ye 20575382a2dSJiajian Ye 20675382a2dSJiajian Ye Examples: 20775382a2dSJiajian Ye ./page_owner_sort <input> <output> --pid=1 20875382a2dSJiajian Ye ./page_owner_sort <input> <output> --tgid=1,2,3 20975382a2dSJiajian Ye ./page_owner_sort <input> <output> --name name1,name2 2109c8a0a8eSJiajian Ye 2119c8a0a8eSJiajian YeSTANDARD FORMAT SPECIFIERS 2129c8a0a8eSJiajian Ye========================== 2135603f9bdSAkira Yokosawa:: 2149c8a0a8eSJiajian Ye 215ebbeae36SJiajian Ye For --sort option: 216ebbeae36SJiajian Ye 217ebbeae36SJiajian Ye KEY LONG DESCRIPTION 218ebbeae36SJiajian Ye p pid process ID 219ebbeae36SJiajian Ye tg tgid thread group ID 220ebbeae36SJiajian Ye n name task command name 221ebbeae36SJiajian Ye st stacktrace stack trace of the page allocation 222ebbeae36SJiajian Ye T txt full text of block 223ebbeae36SJiajian Ye ft free_ts timestamp of the page when it was released 224ebbeae36SJiajian Ye at alloc_ts timestamp of the page when it was allocated 225f09654bbSYixuan Cao ator allocator memory allocator for pages 226ebbeae36SJiajian Ye 227e7951a3eSChen Xiao For --cull option: 228ebbeae36SJiajian Ye 2299c8a0a8eSJiajian Ye KEY LONG DESCRIPTION 2309c8a0a8eSJiajian Ye p pid process ID 2319c8a0a8eSJiajian Ye tg tgid thread group ID 2329c8a0a8eSJiajian Ye n name task command name 2339c8a0a8eSJiajian Ye f free whether the page has been released or not 234ebbeae36SJiajian Ye st stacktrace stack trace of the page allocation 235f09654bbSYixuan Cao ator allocator memory allocator for pages 236