1.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
2
3=============
4Scrub Control
5=============
6
7Copyright (c) 2024-2025 HiSilicon Limited.
8
9:Author:   Shiju Jose <shiju.jose@huawei.com>
10:License:  The GNU Free Documentation License, Version 1.2 without
11           Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
12           (dual licensed under the GPL v2)
13
14- Written for: 6.15
15
16Introduction
17------------
18
19Increasing DRAM size and cost have made memory subsystem reliability an
20important concern. These modules are used where potentially corrupted data
21could cause expensive or fatal issues. Memory errors are among the top
22hardware failures that cause server and workload crashes.
23
24Memory scrubbing is a feature where an ECC (Error-Correcting Code) engine
25reads data from each memory media location, corrects if necessary and writes
26the corrected data back to the same memory media location.
27
28DIMMs can be scrubbed at a configurable rate to detect uncorrected memory
29errors and attempt recovery from detected errors, providing the following
30benefits:
31
321. Proactively scrubbing DIMMs reduces the chance of a correctable error
33   becoming uncorrectable.
34
352. When detected, uncorrected errors caught in unallocated memory pages are
36   isolated and prevented from being allocated to an application or the OS.
37
383. This reduces the likelihood of software or hardware products encountering
39   memory errors.
40
414. The additional data on failures in memory may be used to build up
42   statistics that are later used to decide whether to use memory repair
43   technologies such as Post Package Repair or Sparing.
44
45There are 2 types of memory scrubbing:
46
471. Background (patrol) scrubbing while the DRAM is otherwise idle.
48
492. On-demand scrubbing for a specific address range or region of memory.
50
51Several types of interfaces to hardware memory scrubbers have been
52identified, such as CXL memory device patrol scrub, CXL DDR5 ECS, ACPI
53RAS2 memory scrubbing, and ACPI NVDIMM ARS (Address Range Scrub).
54
55The control mechanisms vary across different memory scrubbers. To enable
56standardized userspace tooling, there is a need to present these controls
57through a standardized ABI.
58
59A generic memory EDAC scrub control allows users to manage underlying
60scrubbers in the system through a standardized sysfs control interface.  It
61abstracts the management of various scrubbing functionalities into a unified
62set of functions.
63
64Use cases of common scrub control feature
65-----------------------------------------
66
671. Several types of interfaces for hardware memory scrubbers have been
68   identified, including the CXL memory device patrol scrub, CXL DDR5 ECS,
69   ACPI RAS2 memory scrubbing features, ACPI NVDIMM ARS (Address Range Scrub),
70   and software-based memory scrubbers.
71
72   Of the identified interfaces to hardware memory scrubbers some support
73   control over patrol (background) scrubbing (e.g., ACPI RAS2, CXL) and/or
74   on-demand scrubbing (e.g., ACPI RAS2, ACPI ARS). However, the scrub control
75   interfaces vary between memory scrubbers, highlighting the need for
76   a standardized, generic sysfs scrub control interface that is accessible to
77   userspace for administration and use by scripts/tools.
78
792. User-space scrub controls allow users to disable scrubbing if necessary,
80   for example, to disable background patrol scrubbing or adjust the scrub
81   rate for performance-aware operations where background activities need to
82   be minimized or disabled.
83
843. User-space tools enable on-demand scrubbing for specific address ranges,
85   provided that the scrubber supports this functionality.
86
874. User-space tools can also control memory DIMM scrubbing at a configurable
88   scrub rate via sysfs scrub controls. This approach offers several benefits:
89
90   4.1. Detects uncorrectable memory errors early, before user access to affected
91        memory, helping facilitate recovery.
92
93   4.2. Reduces the likelihood of correctable errors developing into uncorrectable
94        errors.
95
965. Policy control for hotplugged memory is necessary because there may not
97   be a system-wide BIOS or similar control to manage scrub settings for a CXL
98   device added after boot. Determining these settings is a policy decision,
99   balancing reliability against performance, so userspace should control it.
100   Therefore, a unified interface is recommended for handling this function in
101   a way that aligns with other similar interfaces, rather than creating a
102   separate one.
103
104Scrubbing features
105------------------
106
107CXL Memory Scrubbing features
108~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109
110CXL spec r3.1 [1]_ section 8.2.9.9.11.1 describes the memory device patrol
111scrub control feature. The device patrol scrub proactively locates and makes
112corrections to errors in regular cycle. The patrol scrub control allows the
113userspace request to change CXL patrol scrubber's configurations.
114
115The patrol scrub control allows the requester to specify the number of
116hours in which the patrol scrub cycles must be completed, provided that
117the requested scrub rate must be within the supported range of the
118scrub rate that the device is capable of. In the CXL driver, the
119number of seconds per scrub cycles, which user requests via sysfs, is
120rescaled to hours per scrub cycles.
121
122In addition, they allow the host to disable the feature in case it interferes
123with performance-aware operations which require the background operations to
124be turned off.
125
126Error Check Scrub (ECS)
127~~~~~~~~~~~~~~~~~~~~~~~
128
129CXL spec r3.1 [1]_ section 8.2.9.9.11.2 describes Error Check Scrub (ECS)
130- a feature defined in the JEDEC DDR5 SDRAM Specification (JESD79-5) and
131allowing DRAM to internally read, correct single-bit errors, and write back
132corrected data bits to the DRAM array while providing transparency to error
133counts.
134
135The DDR5 device contains number of memory media Field Replaceable Units (FRU)
136per device. The DDR5 ECS feature and thus the ECS control driver supports
137configuring the ECS parameters per FRU.
138
139ACPI RAS2 Hardware-based Memory Scrubbing
140~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
141
142ACPI spec 6.5 [2]_ section 5.2.21 ACPI RAS2 describes an ACPI RAS2 table
143which provides interfaces for platform RAS features and supports independent
144RAS controls and capabilities for a given RAS feature for multiple instances
145of the same component in a given system.
146
147Memory RAS features apply to RAS capabilities, controls and operations that
148are specific to memory. RAS2 PCC sub-spaces for memory-specific RAS features
149have a Feature Type of 0x00 (Memory).
150
151The platform can use the hardware-based memory scrubbing feature to expose
152controls and capabilities associated with hardware-based memory scrub
153engines. The RAS2 memory scrubbing feature supports as per spec,
154
1551. Independent memory scrubbing controls for each NUMA domain, identified
156   using its proximity domain.
157
1582. Provision for background (patrol) scrubbing of the entire memory system,
159   as well as on-demand scrubbing for a specific region of memory.
160
161ACPI Address Range Scrubbing (ARS)
162~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
163
164ACPI spec 6.5 [2]_ section 9.19.7.2 describes Address Range Scrubbing (ARS).
165ARS allows the platform to communicate memory errors to system software.
166This capability allows system software to prevent accesses to addresses with
167uncorrectable errors in memory. ARS functions manage all NVDIMMs present in
168the system. Only one scrub can be in progress system wide at any given time.
169
170The following functions are supported as per the specification:
171
1721. Query ARS Capabilities for a given address range, indicates platform
173   supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
174
1752. Start ARS triggers an Address Range Scrub for the given memory range.
176   Address scrubbing can be done for volatile or persistent memory, or both.
177
1783. Query ARS Status command allows software to get the status of ARS,
179   including the progress of ARS and ARS error record.
180
1814. Clear Uncorrectable Error.
182
1835. Translate SPA
184
1856. ARS Error Inject etc.
186
187The kernel supports an existing control for ARS and ARS is currently not
188supported in EDAC.
189
190.. [1] https://computeexpresslink.org/cxl-specification/
191
192.. [2] https://uefi.org/specs/ACPI/6.5/
193
194Comparison of various scrubbing features
195~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
196
197 +--------------+-----------+-----------+-----------+-----------+
198 |              |   ACPI    | CXL patrol|  CXL ECS  |  ARS      |
199 |  Name        |   RAS2    | scrub     |           |           |
200 +--------------+-----------+-----------+-----------+-----------+
201 |              |           |           |           |           |
202 | On-demand    | Supported | No        | No        | Supported |
203 | Scrubbing    |           |           |           |           |
204 |              |           |           |           |           |
205 +--------------+-----------+-----------+-----------+-----------+
206 |              |           |           |           |           |
207 | Background   | Supported | Supported | Supported | No        |
208 | scrubbing    |           |           |           |           |
209 |              |           |           |           |           |
210 +--------------+-----------+-----------+-----------+-----------+
211 |              |           |           |           |           |
212 | Mode of      | Scrub ctrl| per device| per memory|  Unknown  |
213 | scrubbing    | per NUMA  |           | media     |           |
214 |              | domain.   |           |           |           |
215 +--------------+-----------+-----------+-----------+-----------+
216 |              |           |           |           |           |
217 | Query scrub  | Supported | Supported | Supported | Supported |
218 | capabilities |           |           |           |           |
219 |              |           |           |           |           |
220 +--------------+-----------+-----------+-----------+-----------+
221 |              |           |           |           |           |
222 | Setting      | Supported | No        | No        | Supported |
223 | address range|           |           |           |           |
224 |              |           |           |           |           |
225 +--------------+-----------+-----------+-----------+-----------+
226 |              |           |           |           |           |
227 | Setting      | Supported | Supported | No        | No        |
228 | scrub rate   |           |           |           |           |
229 |              |           |           |           |           |
230 +--------------+-----------+-----------+-----------+-----------+
231 |              |           |           |           |           |
232 | Unit for     | Not       | in hours  | No        | No        |
233 | scrub rate   | Defined   |           |           |           |
234 |              |           |           |           |           |
235 +--------------+-----------+-----------+-----------+-----------+
236 |              | Supported |           |           |           |
237 | Scrub        | on-demand | No        | No        | Supported |
238 | status/      | scrubbing |           |           |           |
239 | Completion   | only      |           |           |           |
240 +--------------+-----------+-----------+-----------+-----------+
241 | UC error     |           |CXL general|CXL general| ACPI UCE  |
242 | reporting    | Exception |media/DRAM |media/DRAM | notify and|
243 |              |           |event/media|event/media| query     |
244 |              |           |scan?      |scan?      | ARS status|
245 +--------------+-----------+-----------+-----------+-----------+
246 |              |           |           |           |           |
247 | Support for  | Supported | Supported | Supported | No        |
248 | EDAC control |           |           |           |           |
249 |              |           |           |           |           |
250 +--------------+-----------+-----------+-----------+-----------+
251
252The File System
253---------------
254
255The control attributes of a registered scrubber instance could be
256accessed in:
257
258/sys/bus/edac/devices/<dev-name>/scrubX/
259
260sysfs
261-----
262
263Sysfs files are documented in
264`Documentation/ABI/testing/sysfs-edac-scrub`
265
266`Documentation/ABI/testing/sysfs-edac-ecs`
267