1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4==============================================
5``intel_idle`` CPU Idle Time Management Driver
6==============================================
7
8:Copyright: |copy| 2020 Intel Corporation
9
10:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
11
12
13General Information
14===================
15
16``intel_idle`` is a part of the
17:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
18(``CPUIdle``).  It is the default CPU idle time management driver for the
19Nehalem and later generations of Intel processors, but the level of support for
20a particular processor model in it depends on whether or not it recognizes that
21processor model and may also depend on information coming from the platform
22firmware.  [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
23works in general, so this is the time to get familiar with
24Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
25
26``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
27logical CPU executing it is idle and so it may be possible to put some of the
28processor's functional blocks into low-power states.  That instruction takes two
29arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
30first of which, referred to as a *hint*, can be used by the processor to
31determine what can be done (for details refer to Intel Software Developer’s
32Manual [1]_).  Accordingly, ``intel_idle`` refuses to work with processors in
33which the support for the ``MWAIT`` instruction has been disabled (for example,
34via the platform firmware configuration menu) or which do not support that
35instruction at all.
36
37``intel_idle`` is not modular, so it cannot be unloaded, which means that the
38only way to pass early-configuration-time parameters to it is via the kernel
39command line.
40
41Sysfs Interface
42===============
43
44The ``intel_idle`` driver exposes the following ``sysfs`` attributes in
45``/sys/devices/system/cpu/cpuidle/``:
46
47``intel_c1_demotion``
48	Enable or disable C1 demotion for all CPUs in the system. This file is
49	only exposed on platforms that support the C1 demotion feature and where
50	it was tested. Value 0 means that C1 demotion is disabled, value 1 means
51	that it is enabled. Write 0 or 1 to disable or enable C1 demotion for
52	all CPUs.
53
54	The C1 demotion feature involves the platform firmware demoting deep
55	C-state requests from the OS (e.g., C6 requests) to C1. The idea is that
56	firmware monitors CPU wake-up rate, and if it is higher than a
57	platform-specific threshold, the firmware demotes deep C-state requests
58	to C1. For example, Linux requests C6, but firmware noticed too many
59	wake-ups per second, and it keeps the CPU in C1. When the CPU stays in
60	C1 long enough, the platform promotes it back to C6. This may improve
61	some workloads' performance, but it may also increase power consumption.
62
63.. _intel-idle-enumeration-of-states:
64
65Enumeration of Idle States
66==========================
67
68Each ``MWAIT`` hint value is interpreted by the processor as a license to
69reconfigure itself in a certain way in order to save energy.  The processor
70configurations (with reduced power draw) resulting from that are referred to
71as C-states (in the ACPI terminology) or idle states.  The list of meaningful
72``MWAIT`` hint values and idle states (i.e. low-power configurations of the
73processor) corresponding to them depends on the processor model and it may also
74depend on the configuration of the platform.
75
76In order to create a list of available idle states required by the ``CPUIdle``
77subsystem (see :ref:`idle-states-representation` in
78Documentation/admin-guide/pm/cpuidle.rst),
79``intel_idle`` can use two sources of information: static tables of idle states
80for different processor models included in the driver itself and the ACPI tables
81of the system.  The former are always used if the processor model at hand is
82recognized by ``intel_idle`` and the latter are used if that is required for
83the given processor model (which is the case for all server processor models
84recognized by ``intel_idle``) or if the processor model is not recognized.
85[There is a module parameter that can be used to make the driver use the ACPI
86tables with any processor model recognized by it; see
87`below <intel-idle-parameters_>`_.]
88
89If the ACPI tables are going to be used for building the list of available idle
90states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
91objects corresponding to the CPUs in the system (refer to the ACPI specification
92[2]_ for the description of ``_CST`` and its output package).  Because the
93``CPUIdle`` subsystem expects that the list of idle states supplied by the
94driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
95registered as the ``CPUIdle`` driver for all of the CPUs in the system, the
96driver looks for the first ``_CST`` object returning at least one valid idle
97state description and such that all of the idle states included in its return
98package are of the FFH (Functional Fixed Hardware) type, which means that the
99``MWAIT`` instruction is expected to be used to tell the processor that it can
100enter one of them.  The return package of that ``_CST`` is then assumed to be
101applicable to all of the other CPUs in the system and the idle state
102descriptions extracted from it are stored in a preliminary list of idle states
103coming from the ACPI tables.  [This step is skipped if ``intel_idle`` is
104configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
105
106Next, the first (index 0) entry in the list of available idle states is
107initialized to represent a "polling idle state" (a pseudo-idle state in which
108the target CPU continuously fetches and executes instructions), and the
109subsequent (real) idle state entries are populated as follows.
110
111If the processor model at hand is recognized by ``intel_idle``, there is a
112(static) table of idle state descriptions for it in the driver.  In that case,
113the "internal" table is the primary source of information on idle states and the
114information from it is copied to the final list of available idle states.  If
115using the ACPI tables for the enumeration of idle states is not required
116(depending on the processor model), all of the listed idle state are enabled by
117default (so all of them will be taken into consideration by ``CPUIdle``
118governors during CPU idle state selection).  Otherwise, some of the listed idle
119states may not be enabled by default if there are no matching entries in the
120preliminary list of idle states coming from the ACPI tables.  In that case user
121space still can enable them later (on a per-CPU basis) with the help of
122the ``disable`` idle state attribute in ``sysfs`` (see
123:ref:`idle-states-representation` in
124Documentation/admin-guide/pm/cpuidle.rst).  This basically means that
125the idle states "known" to the driver may not be enabled by default if they have
126not been exposed by the platform firmware (through the ACPI tables).
127
128If the given processor model is not recognized by ``intel_idle``, but it
129supports ``MWAIT``, the preliminary list of idle states coming from the ACPI
130tables is used for building the final list that will be supplied to the
131``CPUIdle`` core during driver registration.  For each idle state in that list,
132the description, ``MWAIT`` hint and exit latency are copied to the corresponding
133entry in the final list of idle states.  The name of the idle state represented
134by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
135"CX_ACPI", where X is the index of that idle state in the final list (note that
136the minimum value of X is 1, because 0 is reserved for the "polling" state), and
137its target residency is based on the exit latency value.  Specifically, for
138C1-type idle states the exit latency value is also used as the target residency
139(for compatibility with the majority of the "internal" tables of idle states for
140various processor models recognized by ``intel_idle``) and for the other idle
141state types (C2 and C3) the target residency value is 3 times the exit latency
142(again, that is because it reflects the target residency to exit latency ratio
143in the majority of cases for the processor models recognized by ``intel_idle``).
144All of the idle states in the final list are enabled by default in this case.
145
146
147.. _intel-idle-initialization:
148
149Initialization
150==============
151
152The initialization of ``intel_idle`` starts with checking if the kernel command
153line options forbid the use of the ``MWAIT`` instruction.  If that is the case,
154an error code is returned right away.
155
156The next step is to check whether or not the processor model is known to the
157driver, which determines the idle states enumeration method (see
158`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
159supports ``MWAIT`` (the initialization fails if that is not the case).  Then,
160the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
161driver initialization fails if the level of support is not as expected (for
162example, if the total number of ``MWAIT`` substates returned is 0).
163
164Next, if the driver is not configured to ignore the ACPI tables (see
165`below <intel-idle-parameters_>`_), the idle states information provided by the
166platform firmware is extracted from them.
167
168Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of
169available idle states is created as explained
170`above <intel-idle-enumeration-of-states_>`_.
171
172Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
173as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
174for configuring individual CPUs is registered via cpuhp_setup_state(), which
175(among other things) causes the callback routine to be invoked for all of the
176CPUs present in the system at that time (each CPU executes its own instance of
177the callback routine).  That routine registers a ``CPUIdle`` device for the CPU
178running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
179optionally performs some CPU-specific initialization actions that may be
180required for the given processor model.
181
182
183.. _intel-idle-parameters:
184
185Kernel Command Line Options and Module Parameters
186=================================================
187
188The *x86* architecture support code recognizes three kernel command line
189options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
190and ``idle=nomwait``.  If any of them is present in the kernel command line, the
191``MWAIT`` instruction is not allowed to be used, so the initialization of
192``intel_idle`` will fail.
193
194Apart from that there are five module parameters recognized by ``intel_idle``
195itself that can be set via the kernel command line (they cannot be updated via
196sysfs, so that is the only way to change their values).
197
198The ``max_cstate`` parameter value is the maximum idle state index in the list
199of idle states supplied to the ``CPUIdle`` core during the registration of the
200driver.  It is also the maximum number of regular (non-polling) idle states that
201can be used by ``intel_idle``, so the enumeration of idle states is terminated
202after finding that number of usable idle states (the other idle states that
203potentially might have been used if ``max_cstate`` had been greater are not
204taken into consideration at all).  Setting ``max_cstate`` can prevent
205``intel_idle`` from exposing idle states that are regarded as "too deep" for
206some reason to the ``CPUIdle`` core, but it does so by making them effectively
207invisible until the system is shut down and started again which may not always
208be desirable.  In practice, it is only really necessary to do that if the idle
209states in question cannot be enabled during system startup, because in the
210working state of the system the CPU power management quality of service (PM
211QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
212even if they have been enumerated (see :ref:`cpu-pm-qos` in
213Documentation/admin-guide/pm/cpuidle.rst).
214Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
215
216The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are
217recognized by ``intel_idle`` if the kernel has been configured with ACPI
218support.  In the case that ACPI is not configured these flags have no impact
219on functionality.
220
221``no_acpi`` - Do not use ACPI at all.  Only native mode is available, no
222ACPI mode.
223
224``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for
225C-states on/off status in native mode.
226
227``no_native`` - Work only in ACPI mode, no native mode available (ignore
228all custom tables).
229
230The value of the ``states_off`` module parameter (0 by default) represents a
231list of idle states to be disabled by default in the form of a bitmask.
232
233Namely, the positions of the bits that are set in the ``states_off`` value are
234the indices of idle states to be disabled by default (as reflected by the names
235of the corresponding idle state directories in ``sysfs``, :file:`state0`,
236:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
237idle state; see :ref:`idle-states-representation` in
238Documentation/admin-guide/pm/cpuidle.rst).
239
240For example, if ``states_off`` is equal to 3, the driver will disable idle
241states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
242disabled by default and so on (bit positions beyond the maximum idle state index
243are ignored).
244
245The idle states disabled this way can be enabled (on a per-CPU basis) from user
246space via ``sysfs``.
247
248The ``ibrs_off`` module parameter is a boolean flag (defaults to
249false). If set, it is used to control if IBRS (Indirect Branch Restricted
250Speculation) should be turned off when the CPU enters an idle state.
251This flag does not affect CPUs that use Enhanced IBRS which can remain
252on with little performance impact.
253
254For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed
255security vulnerabilities by default.  Leaving the IBRS mode on while idling may
256have a performance impact on its sibling CPU.  The IBRS mode will be turned off
257by default when the CPU enters into a deep idle state, but not in some
258shallower ones.  Setting the ``ibrs_off`` module parameter will force the IBRS
259mode to off when the CPU is in any one of the available idle states.  This may
260help performance of a sibling CPU at the expense of a slightly higher wakeup
261latency for the idle CPU.
262
263
264.. _intel-idle-core-and-package-idle-states:
265
266Core and Package Levels of Idle States
267======================================
268
269Typically, in a processor supporting the ``MWAIT`` instruction there are (at
270least) two levels of idle states (or C-states).  One level, referred to as
271"core C-states", covers individual cores in the processor, whereas the other
272level, referred to as "package C-states", covers the entire processor package
273and it may also involve other components of the system (GPUs, memory
274controllers, I/O hubs etc.).
275
276Some of the ``MWAIT`` hint values allow the processor to use core C-states only
277(most importantly, that is the case for the ``MWAIT`` hint value corresponding
278to the ``C1`` idle state), but the majority of them give it a license to put
279the target core (i.e. the core containing the logical CPU executing ``MWAIT``
280with the given hint value) into a specific core C-state and then (if possible)
281to enter a specific package C-state at the deeper level.  For example, the
282``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
283put the target core into the low-power state referred to as "core ``C3``" (or
284``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
285have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
286representing a deeper idle state), and in addition to that (in the majority of
287cases) it gives the processor a license to put the entire package (possibly
288including some non-CPU components such as a GPU or a memory controller) into the
289low-power state referred to as "package ``C3``" (or ``PC3``), which happens if
290all of the cores have gone into the ``CC3`` state and (possibly) some additional
291conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
292be required to be in a certain GPU-specific low-power state for ``PC3`` to be
293reachable).
294
295As a rule, there is no simple way to make the processor use core C-states only
296if the conditions for entering the corresponding package C-states are met, so
297the logical CPU executing ``MWAIT`` with a hint value that is not core-level
298only (like for ``C1``) must always assume that this may cause the processor to
299enter a package C-state.  [That is why the exit latency and target residency
300values corresponding to the majority of ``MWAIT`` hint values in the "internal"
301tables of idle states in ``intel_idle`` reflect the properties of package
302C-states.]  If using package C-states is not desirable at all, either
303:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
304``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
305restrict the range of permissible idle states to the ones with core-level only
306``MWAIT`` hint values (like ``C1``).
307
308
309References
310==========
311
312.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
313       https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
314
315.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
316       https://uefi.org/specifications
317