xref: /linux/Documentation/admin-guide/mm/idle_page_tracking.rst (revision 3eb66e91a25497065c5322b1268cbc3953642227)
1e3f2025aSMike Rapoport.. _idle_page_tracking:
2e3f2025aSMike Rapoport
3e3f2025aSMike Rapoport==================
4e3f2025aSMike RapoportIdle Page Tracking
5e3f2025aSMike Rapoport==================
6e3f2025aSMike Rapoport
7e3f2025aSMike RapoportMotivation
8e3f2025aSMike Rapoport==========
933c3fc71SVladimir Davydov
1033c3fc71SVladimir DavydovThe idle page tracking feature allows to track which memory pages are being
1133c3fc71SVladimir Davydovaccessed by a workload and which are idle. This information can be useful for
1233c3fc71SVladimir Davydovestimating the workload's working set size, which, in turn, can be taken into
1333c3fc71SVladimir Davydovaccount when configuring the workload parameters, setting memory cgroup limits,
1433c3fc71SVladimir Davydovor deciding where to place the workload within a compute cluster.
1533c3fc71SVladimir Davydov
1633c3fc71SVladimir DavydovIt is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
1733c3fc71SVladimir Davydov
18e3f2025aSMike Rapoport.. _user_api:
1933c3fc71SVladimir Davydov
20e3f2025aSMike RapoportUser API
21e3f2025aSMike Rapoport========
22e3f2025aSMike Rapoport
23e3f2025aSMike RapoportThe idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
24e3f2025aSMike RapoportCurrently, it consists of the only read-write file,
25e3f2025aSMike Rapoport``/sys/kernel/mm/page_idle/bitmap``.
2633c3fc71SVladimir Davydov
2733c3fc71SVladimir DavydovThe file implements a bitmap where each bit corresponds to a memory page. The
2833c3fc71SVladimir Davydovbitmap is represented by an array of 8-byte integers, and the page at PFN #i is
2933c3fc71SVladimir Davydovmapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
3033c3fc71SVladimir Davydovset, the corresponding page is idle.
3133c3fc71SVladimir Davydov
3233c3fc71SVladimir DavydovA page is considered idle if it has not been accessed since it was marked idle
33e3f2025aSMike Rapoport(for more details on what "accessed" actually means see the :ref:`Implementation
34e3f2025aSMike RapoportDetails <impl_details>` section).
35e3f2025aSMike RapoportTo mark a page idle one has to set the bit corresponding to
3633c3fc71SVladimir Davydovthe page by writing to the file. A value written to the file is OR-ed with the
3733c3fc71SVladimir Davydovcurrent bitmap value.
3833c3fc71SVladimir Davydov
3933c3fc71SVladimir DavydovOnly accesses to user memory pages are tracked. These are pages mapped to a
4033c3fc71SVladimir Davydovprocess address space, page cache and buffer pages, swap cache pages. For other
4133c3fc71SVladimir Davydovpage types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
4233c3fc71SVladimir Davydovand hence such pages are never reported idle.
4333c3fc71SVladimir Davydov
4433c3fc71SVladimir DavydovFor huge pages the idle flag is set only on the head page, so one has to read
45e3f2025aSMike Rapoport``/proc/kpageflags`` in order to correctly count idle huge pages.
4633c3fc71SVladimir Davydov
47e3f2025aSMike RapoportReading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
4833c3fc71SVladimir Davydov-EINVAL if you are not starting the read/write on an 8-byte boundary, or
4933c3fc71SVladimir Davydovif the size of the read/write is not a multiple of 8 bytes. Writing to
5033c3fc71SVladimir Davydovthis file beyond max PFN will return -ENXIO.
5133c3fc71SVladimir Davydov
5233c3fc71SVladimir DavydovThat said, in order to estimate the amount of pages that are not used by a
5333c3fc71SVladimir Davydovworkload one should:
5433c3fc71SVladimir Davydov
5533c3fc71SVladimir Davydov 1. Mark all the workload's pages as idle by setting corresponding bits in
56e3f2025aSMike Rapoport    ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
57e3f2025aSMike Rapoport    ``/proc/pid/pagemap`` if the workload is represented by a process, or by
58e3f2025aSMike Rapoport    filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
59e3f2025aSMike Rapoport    is placed in a memory cgroup.
6033c3fc71SVladimir Davydov
6133c3fc71SVladimir Davydov 2. Wait until the workload accesses its working set.
6233c3fc71SVladimir Davydov
63e3f2025aSMike Rapoport 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
64e3f2025aSMike Rapoport    If one wants to ignore certain types of pages, e.g. mlocked pages since they
65e3f2025aSMike Rapoport    are not reclaimable, he or she can filter them out using
66e3f2025aSMike Rapoport    ``/proc/kpageflags``.
6733c3fc71SVladimir Davydov
68*59ae96ffSChristian HansenThe page-types tool in the tools/vm directory can be used to assist in this.
69*59ae96ffSChristian HansenIf the tool is run initially with the appropriate option, it will mark all the
70*59ae96ffSChristian Hansenqueried pages as idle.  Subsequent runs of the tool can then show which pages have
71*59ae96ffSChristian Hansentheir idle flag cleared in the interim.
72*59ae96ffSChristian Hansen
73e27a20f1SMike RapoportSee :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
74e27a20f1SMike Rapoportinformation about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
75e27a20f1SMike Rapoport``/proc/kpagecgroup``.
7633c3fc71SVladimir Davydov
77e3f2025aSMike Rapoport.. _impl_details:
78e3f2025aSMike Rapoport
79e3f2025aSMike RapoportImplementation Details
80e3f2025aSMike Rapoport======================
8133c3fc71SVladimir Davydov
8233c3fc71SVladimir DavydovThe kernel internally keeps track of accesses to user memory pages in order to
8333c3fc71SVladimir Davydovreclaim unreferenced pages first on memory shortage conditions. A page is
8433c3fc71SVladimir Davydovconsidered referenced if it has been recently accessed via a process address
8533c3fc71SVladimir Davydovspace, in which case one or more PTEs it is mapped to will have the Accessed bit
8633c3fc71SVladimir Davydovset, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
8733c3fc71SVladimir Davydovlatter happens when:
8833c3fc71SVladimir Davydov
8933c3fc71SVladimir Davydov - a userspace process reads or writes a page using a system call (e.g. read(2)
9033c3fc71SVladimir Davydov   or write(2))
9133c3fc71SVladimir Davydov
9233c3fc71SVladimir Davydov - a page that is used for storing filesystem buffers is read or written,
9333c3fc71SVladimir Davydov   because a process needs filesystem metadata stored in it (e.g. lists a
9433c3fc71SVladimir Davydov   directory tree)
9533c3fc71SVladimir Davydov
9633c3fc71SVladimir Davydov - a page is accessed by a device driver using get_user_pages()
9733c3fc71SVladimir Davydov
9833c3fc71SVladimir DavydovWhen a dirty page is written to swap or disk as a result of memory reclaim or
9933c3fc71SVladimir Davydovexceeding the dirty memory limit, it is not marked referenced.
10033c3fc71SVladimir Davydov
10133c3fc71SVladimir DavydovThe idle memory tracking feature adds a new page flag, the Idle flag. This flag
102e3f2025aSMike Rapoportis set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
103e3f2025aSMike Rapoport:ref:`User API <user_api>`
10433c3fc71SVladimir Davydovsection), and cleared automatically whenever a page is referenced as defined
10533c3fc71SVladimir Davydovabove.
10633c3fc71SVladimir Davydov
10733c3fc71SVladimir DavydovWhen a page is marked idle, the Accessed bit must be cleared in all PTEs it is
10833c3fc71SVladimir Davydovmapped to, otherwise we will not be able to detect accesses to the page coming
10933c3fc71SVladimir Davydovfrom a process address space. To avoid interference with the reclaimer, which,
11033c3fc71SVladimir Davydovas noted above, uses the Accessed bit to promote actively referenced pages, one
11133c3fc71SVladimir Davydovmore page flag is introduced, the Young flag. When the PTE Accessed bit is
11233c3fc71SVladimir Davydovcleared as a result of setting or updating a page's Idle flag, the Young flag
11333c3fc71SVladimir Davydovis set on the page. The reclaimer treats the Young flag as an extra PTE
11433c3fc71SVladimir DavydovAccessed bit and therefore will consider such a page as referenced.
11533c3fc71SVladimir Davydov
11633c3fc71SVladimir DavydovSince the idle memory tracking feature is based on the memory reclaimer logic,
11733c3fc71SVladimir Davydovit only works with pages that are on an LRU list, other pages are silently
11833c3fc71SVladimir Davydovignored. That means it will ignore a user memory page if it is isolated, but
11933c3fc71SVladimir Davydovsince there are usually not many of them, it should not affect the overall
12033c3fc71SVladimir Davydovresult noticeably. In order not to stall scanning of the idle page bitmap,
12133c3fc71SVladimir Davydovlocked pages may be skipped too.
122