1e3f2025aSMike Rapoport.. _idle_page_tracking: 2e3f2025aSMike Rapoport 3e3f2025aSMike Rapoport================== 4e3f2025aSMike RapoportIdle Page Tracking 5e3f2025aSMike Rapoport================== 6e3f2025aSMike Rapoport 7e3f2025aSMike RapoportMotivation 8e3f2025aSMike Rapoport========== 933c3fc71SVladimir Davydov 1033c3fc71SVladimir DavydovThe idle page tracking feature allows to track which memory pages are being 1133c3fc71SVladimir Davydovaccessed by a workload and which are idle. This information can be useful for 1233c3fc71SVladimir Davydovestimating the workload's working set size, which, in turn, can be taken into 1333c3fc71SVladimir Davydovaccount when configuring the workload parameters, setting memory cgroup limits, 1433c3fc71SVladimir Davydovor deciding where to place the workload within a compute cluster. 1533c3fc71SVladimir Davydov 1633c3fc71SVladimir DavydovIt is enabled by CONFIG_IDLE_PAGE_TRACKING=y. 1733c3fc71SVladimir Davydov 18e3f2025aSMike Rapoport.. _user_api: 1933c3fc71SVladimir Davydov 20e3f2025aSMike RapoportUser API 21e3f2025aSMike Rapoport======== 22e3f2025aSMike Rapoport 23e3f2025aSMike RapoportThe idle page tracking API is located at ``/sys/kernel/mm/page_idle``. 24e3f2025aSMike RapoportCurrently, it consists of the only read-write file, 25e3f2025aSMike Rapoport``/sys/kernel/mm/page_idle/bitmap``. 2633c3fc71SVladimir Davydov 2733c3fc71SVladimir DavydovThe file implements a bitmap where each bit corresponds to a memory page. The 2833c3fc71SVladimir Davydovbitmap is represented by an array of 8-byte integers, and the page at PFN #i is 2933c3fc71SVladimir Davydovmapped to bit #i%64 of array element #i/64, byte order is native. When a bit is 3033c3fc71SVladimir Davydovset, the corresponding page is idle. 3133c3fc71SVladimir Davydov 3233c3fc71SVladimir DavydovA page is considered idle if it has not been accessed since it was marked idle 33e3f2025aSMike Rapoport(for more details on what "accessed" actually means see the :ref:`Implementation 34e3f2025aSMike RapoportDetails <impl_details>` section). 35e3f2025aSMike RapoportTo mark a page idle one has to set the bit corresponding to 3633c3fc71SVladimir Davydovthe page by writing to the file. A value written to the file is OR-ed with the 3733c3fc71SVladimir Davydovcurrent bitmap value. 3833c3fc71SVladimir Davydov 3933c3fc71SVladimir DavydovOnly accesses to user memory pages are tracked. These are pages mapped to a 4033c3fc71SVladimir Davydovprocess address space, page cache and buffer pages, swap cache pages. For other 4133c3fc71SVladimir Davydovpage types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, 4233c3fc71SVladimir Davydovand hence such pages are never reported idle. 4333c3fc71SVladimir Davydov 4433c3fc71SVladimir DavydovFor huge pages the idle flag is set only on the head page, so one has to read 45e3f2025aSMike Rapoport``/proc/kpageflags`` in order to correctly count idle huge pages. 4633c3fc71SVladimir Davydov 47e3f2025aSMike RapoportReading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return 4833c3fc71SVladimir Davydov-EINVAL if you are not starting the read/write on an 8-byte boundary, or 4933c3fc71SVladimir Davydovif the size of the read/write is not a multiple of 8 bytes. Writing to 5033c3fc71SVladimir Davydovthis file beyond max PFN will return -ENXIO. 5133c3fc71SVladimir Davydov 5233c3fc71SVladimir DavydovThat said, in order to estimate the amount of pages that are not used by a 5333c3fc71SVladimir Davydovworkload one should: 5433c3fc71SVladimir Davydov 5533c3fc71SVladimir Davydov 1. Mark all the workload's pages as idle by setting corresponding bits in 56e3f2025aSMike Rapoport ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading 57e3f2025aSMike Rapoport ``/proc/pid/pagemap`` if the workload is represented by a process, or by 58e3f2025aSMike Rapoport filtering out alien pages using ``/proc/kpagecgroup`` in case the workload 59e3f2025aSMike Rapoport is placed in a memory cgroup. 6033c3fc71SVladimir Davydov 6133c3fc71SVladimir Davydov 2. Wait until the workload accesses its working set. 6233c3fc71SVladimir Davydov 63e3f2025aSMike Rapoport 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. 64e3f2025aSMike Rapoport If one wants to ignore certain types of pages, e.g. mlocked pages since they 65e3f2025aSMike Rapoport are not reclaimable, he or she can filter them out using 66e3f2025aSMike Rapoport ``/proc/kpageflags``. 6733c3fc71SVladimir Davydov 68*59ae96ffSChristian HansenThe page-types tool in the tools/vm directory can be used to assist in this. 69*59ae96ffSChristian HansenIf the tool is run initially with the appropriate option, it will mark all the 70*59ae96ffSChristian Hansenqueried pages as idle. Subsequent runs of the tool can then show which pages have 71*59ae96ffSChristian Hansentheir idle flag cleared in the interim. 72*59ae96ffSChristian Hansen 73e27a20f1SMike RapoportSee :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more 74e27a20f1SMike Rapoportinformation about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and 75e27a20f1SMike Rapoport``/proc/kpagecgroup``. 7633c3fc71SVladimir Davydov 77e3f2025aSMike Rapoport.. _impl_details: 78e3f2025aSMike Rapoport 79e3f2025aSMike RapoportImplementation Details 80e3f2025aSMike Rapoport====================== 8133c3fc71SVladimir Davydov 8233c3fc71SVladimir DavydovThe kernel internally keeps track of accesses to user memory pages in order to 8333c3fc71SVladimir Davydovreclaim unreferenced pages first on memory shortage conditions. A page is 8433c3fc71SVladimir Davydovconsidered referenced if it has been recently accessed via a process address 8533c3fc71SVladimir Davydovspace, in which case one or more PTEs it is mapped to will have the Accessed bit 8633c3fc71SVladimir Davydovset, or marked accessed explicitly by the kernel (see mark_page_accessed()). The 8733c3fc71SVladimir Davydovlatter happens when: 8833c3fc71SVladimir Davydov 8933c3fc71SVladimir Davydov - a userspace process reads or writes a page using a system call (e.g. read(2) 9033c3fc71SVladimir Davydov or write(2)) 9133c3fc71SVladimir Davydov 9233c3fc71SVladimir Davydov - a page that is used for storing filesystem buffers is read or written, 9333c3fc71SVladimir Davydov because a process needs filesystem metadata stored in it (e.g. lists a 9433c3fc71SVladimir Davydov directory tree) 9533c3fc71SVladimir Davydov 9633c3fc71SVladimir Davydov - a page is accessed by a device driver using get_user_pages() 9733c3fc71SVladimir Davydov 9833c3fc71SVladimir DavydovWhen a dirty page is written to swap or disk as a result of memory reclaim or 9933c3fc71SVladimir Davydovexceeding the dirty memory limit, it is not marked referenced. 10033c3fc71SVladimir Davydov 10133c3fc71SVladimir DavydovThe idle memory tracking feature adds a new page flag, the Idle flag. This flag 102e3f2025aSMike Rapoportis set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the 103e3f2025aSMike Rapoport:ref:`User API <user_api>` 10433c3fc71SVladimir Davydovsection), and cleared automatically whenever a page is referenced as defined 10533c3fc71SVladimir Davydovabove. 10633c3fc71SVladimir Davydov 10733c3fc71SVladimir DavydovWhen a page is marked idle, the Accessed bit must be cleared in all PTEs it is 10833c3fc71SVladimir Davydovmapped to, otherwise we will not be able to detect accesses to the page coming 10933c3fc71SVladimir Davydovfrom a process address space. To avoid interference with the reclaimer, which, 11033c3fc71SVladimir Davydovas noted above, uses the Accessed bit to promote actively referenced pages, one 11133c3fc71SVladimir Davydovmore page flag is introduced, the Young flag. When the PTE Accessed bit is 11233c3fc71SVladimir Davydovcleared as a result of setting or updating a page's Idle flag, the Young flag 11333c3fc71SVladimir Davydovis set on the page. The reclaimer treats the Young flag as an extra PTE 11433c3fc71SVladimir DavydovAccessed bit and therefore will consider such a page as referenced. 11533c3fc71SVladimir Davydov 11633c3fc71SVladimir DavydovSince the idle memory tracking feature is based on the memory reclaimer logic, 11733c3fc71SVladimir Davydovit only works with pages that are on an LRU list, other pages are silently 11833c3fc71SVladimir Davydovignored. That means it will ignore a user memory page if it is isolated, but 11933c3fc71SVladimir Davydovsince there are usually not many of them, it should not affect the overall 12033c3fc71SVladimir Davydovresult noticeably. In order not to stall scanning of the idle page bitmap, 12133c3fc71SVladimir Davydovlocked pages may be skipped too. 122