xref: /linux/Documentation/admin-guide/mm/numaperf.rst (revision c441bfb5f2866de71e092c1b9d866a65978dfe1a)
113bac55eSKeith Busch.. _numaperf:
213bac55eSKeith Busch
313bac55eSKeith Busch=============
413bac55eSKeith BuschNUMA Locality
513bac55eSKeith Busch=============
613bac55eSKeith Busch
713bac55eSKeith BuschSome platforms may have multiple types of memory attached to a compute
813bac55eSKeith Buschnode. These disparate memory ranges may share some characteristics, such
913bac55eSKeith Buschas CPU cache coherence, but may have different performance. For example,
1013bac55eSKeith Buschdifferent media types and buses affect bandwidth and latency.
1113bac55eSKeith Busch
1213bac55eSKeith BuschA system supports such heterogeneous memory by grouping each memory type
1313bac55eSKeith Buschunder different domains, or "nodes", based on locality and performance
1413bac55eSKeith Buschcharacteristics.  Some memory may share the same node as a CPU, and others
1513bac55eSKeith Buschare provided as memory only nodes. While memory only nodes do not provide
1613bac55eSKeith BuschCPUs, they may still be local to one or more compute nodes relative to
1713bac55eSKeith Buschother nodes. The following diagram shows one such example of two compute
188867f610SJonathan Corbetnodes with local memory and a memory only node for each of compute node::
1913bac55eSKeith Busch
2013bac55eSKeith Busch +------------------+     +------------------+
2113bac55eSKeith Busch | Compute Node 0   +-----+ Compute Node 1   |
2213bac55eSKeith Busch | Local Node0 Mem  |     | Local Node1 Mem  |
2313bac55eSKeith Busch +--------+---------+     +--------+---------+
2413bac55eSKeith Busch          |                        |
2513bac55eSKeith Busch +--------+---------+     +--------+---------+
2613bac55eSKeith Busch | Slower Node2 Mem |     | Slower Node3 Mem |
2713bac55eSKeith Busch +------------------+     +--------+---------+
2813bac55eSKeith Busch
2913bac55eSKeith BuschA "memory initiator" is a node containing one or more devices such as
3013bac55eSKeith BuschCPUs or separate memory I/O devices that can initiate memory requests.
3113bac55eSKeith BuschA "memory target" is a node containing one or more physical address
3213bac55eSKeith Buschranges accessible from one or more memory initiators.
3313bac55eSKeith Busch
3413bac55eSKeith BuschWhen multiple memory initiators exist, they may not all have the same
3513bac55eSKeith Buschperformance when accessing a given memory target. Each initiator-target
3613bac55eSKeith Buschpair may be organized into different ranked access classes to represent
3713bac55eSKeith Buschthis relationship. The highest performing initiator to a given target
3813bac55eSKeith Buschis considered to be one of that target's local initiators, and given
3913bac55eSKeith Buschthe highest access class, 0. Any given target may have one or more
4013bac55eSKeith Buschlocal initiators, and any given initiator may have multiple local
4113bac55eSKeith Buschmemory targets.
4213bac55eSKeith Busch
4313bac55eSKeith BuschTo aid applications matching memory targets with their initiators, the
4413bac55eSKeith Buschkernel provides symlinks to each other. The following example lists the
4513bac55eSKeith Buschrelationship for the access class "0" memory initiators and targets::
4613bac55eSKeith Busch
4713bac55eSKeith Busch	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
4813bac55eSKeith Busch	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
4913bac55eSKeith Busch
5013bac55eSKeith Busch	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
5113bac55eSKeith Busch	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
5213bac55eSKeith Busch
5313bac55eSKeith BuschA memory initiator may have multiple memory targets in the same access
5413bac55eSKeith Buschclass. The target memory's initiators in a given class indicate the
5513bac55eSKeith Buschnodes' access characteristics share the same performance relative to other
5613bac55eSKeith Buschlinked initiator nodes. Each target within an initiator's access class,
5713bac55eSKeith Buschthough, do not necessarily perform the same as each other.
5813bac55eSKeith Busch
59dc9e7860SJonathan CameronThe access class "1" is used to allow differentiation between initiators
60dc9e7860SJonathan Cameronthat are CPUs and hence suitable for generic task scheduling, and
61dc9e7860SJonathan CameronIO initiators such as GPUs and NICs.  Unlike access class 0, only
62dc9e7860SJonathan Cameronnodes containing CPUs are considered.
63dc9e7860SJonathan Cameron
6413bac55eSKeith Busch================
6513bac55eSKeith BuschNUMA Performance
6613bac55eSKeith Busch================
6713bac55eSKeith Busch
6813bac55eSKeith BuschApplications may wish to consider which node they want their memory to
6913bac55eSKeith Buschbe allocated from based on the node's performance characteristics. If
7013bac55eSKeith Buschthe system provides these attributes, the kernel exports them under the
7113bac55eSKeith Buschnode sysfs hierarchy by appending the attributes directory under the
7213bac55eSKeith Buschmemory node's access class 0 initiators as follows::
7313bac55eSKeith Busch
7413bac55eSKeith Busch	/sys/devices/system/node/nodeY/access0/initiators/
7513bac55eSKeith Busch
7613bac55eSKeith BuschThese attributes apply only when accessed from nodes that have the
77751d5b27SAndrew Klychkovare linked under the this access's initiators.
7813bac55eSKeith Busch
7913bac55eSKeith BuschThe performance characteristics the kernel provides for the local initiators
8013bac55eSKeith Buschare exported are as follows::
8113bac55eSKeith Busch
8213bac55eSKeith Busch	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
8313bac55eSKeith Busch	/sys/devices/system/node/nodeY/access0/initiators/
8413bac55eSKeith Busch	|-- read_bandwidth
8513bac55eSKeith Busch	|-- read_latency
8613bac55eSKeith Busch	|-- write_bandwidth
8713bac55eSKeith Busch	`-- write_latency
8813bac55eSKeith Busch
8913bac55eSKeith BuschThe bandwidth attributes are provided in MiB/second.
9013bac55eSKeith Busch
9113bac55eSKeith BuschThe latency attributes are provided in nanoseconds.
9213bac55eSKeith Busch
9313bac55eSKeith BuschThe values reported here correspond to the rated latency and bandwidth
9413bac55eSKeith Buschfor the platform.
9513bac55eSKeith Busch
96dc9e7860SJonathan CameronAccess class 1 takes the same form but only includes values for CPU to
97dc9e7860SJonathan Cameronmemory activity.
98dc9e7860SJonathan Cameron
9913bac55eSKeith Busch==========
10013bac55eSKeith BuschNUMA Cache
10113bac55eSKeith Busch==========
10213bac55eSKeith Busch
10313bac55eSKeith BuschSystem memory may be constructed in a hierarchy of elements with various
10413bac55eSKeith Buschperformance characteristics in order to provide large address space of
10513bac55eSKeith Buschslower performing memory cached by a smaller higher performing memory. The
10613bac55eSKeith Buschsystem physical addresses memory  initiators are aware of are provided
10713bac55eSKeith Buschby the last memory level in the hierarchy. The system meanwhile uses
10813bac55eSKeith Buschhigher performing memory to transparently cache access to progressively
10913bac55eSKeith Buschslower levels.
11013bac55eSKeith Busch
11113bac55eSKeith BuschThe term "far memory" is used to denote the last level memory in the
11213bac55eSKeith Buschhierarchy. Each increasing cache level provides higher performing
11313bac55eSKeith Buschinitiator access, and the term "near memory" represents the fastest
11413bac55eSKeith Buschcache provided by the system.
11513bac55eSKeith Busch
11613bac55eSKeith BuschThis numbering is different than CPU caches where the cache level (ex:
11713bac55eSKeith BuschL1, L2, L3) uses the CPU-side view where each increased level is lower
11813bac55eSKeith Buschperforming. In contrast, the memory cache level is centric to the last
11913bac55eSKeith Buschlevel memory, so the higher numbered cache level corresponds to  memory
12013bac55eSKeith Buschnearer to the CPU, and further from far memory.
12113bac55eSKeith Busch
12213bac55eSKeith BuschThe memory-side caches are not directly addressable by software. When
12313bac55eSKeith Buschsoftware accesses a system address, the system will return it from the
12413bac55eSKeith Buschnear memory cache if it is present. If it is not present, the system
12513bac55eSKeith Buschaccesses the next level of memory until there is either a hit in that
12613bac55eSKeith Buschcache level, or it reaches far memory.
12713bac55eSKeith Busch
12813bac55eSKeith BuschAn application does not need to know about caching attributes in order
12913bac55eSKeith Buschto use the system. Software may optionally query the memory cache
13013bac55eSKeith Buschattributes in order to maximize the performance out of such a setup.
13113bac55eSKeith BuschIf the system provides a way for the kernel to discover this information,
13213bac55eSKeith Buschfor example with ACPI HMAT (Heterogeneous Memory Attribute Table),
13313bac55eSKeith Buschthe kernel will append these attributes to the NUMA node memory target.
13413bac55eSKeith Busch
13513bac55eSKeith BuschWhen the kernel first registers a memory cache with a node, the kernel
13613bac55eSKeith Buschwill create the following directory::
13713bac55eSKeith Busch
13813bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/
13913bac55eSKeith Busch
140eeb3dc58SRandy DunlapIf that directory is not present, the system either does not provide
14113bac55eSKeith Buscha memory-side cache, or that information is not accessible to the kernel.
14213bac55eSKeith Busch
14313bac55eSKeith BuschThe attributes for each level of cache is provided under its cache
14413bac55eSKeith Buschlevel index::
14513bac55eSKeith Busch
14613bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/indexA/
14713bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/indexB/
14813bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/indexC/
14913bac55eSKeith Busch
15013bac55eSKeith BuschEach cache level's directory provides its attributes. For example, the
15113bac55eSKeith Buschfollowing shows a single cache level and the attributes available for
15213bac55eSKeith Buschsoftware to query::
15313bac55eSKeith Busch
154*abb9c078SMark O'Donovan	# tree /sys/devices/system/node/node0/memory_side_cache/
15513bac55eSKeith Busch	/sys/devices/system/node/node0/memory_side_cache/
15613bac55eSKeith Busch	|-- index1
15713bac55eSKeith Busch	|   |-- indexing
15813bac55eSKeith Busch	|   |-- line_size
15913bac55eSKeith Busch	|   |-- size
16013bac55eSKeith Busch	|   `-- write_policy
16113bac55eSKeith Busch
16213bac55eSKeith BuschThe "indexing" will be 0 if it is a direct-mapped cache, and non-zero
16313bac55eSKeith Buschfor any other indexed based, multi-way associativity.
16413bac55eSKeith Busch
16513bac55eSKeith BuschThe "line_size" is the number of bytes accessed from the next cache
16613bac55eSKeith Buschlevel on a miss.
16713bac55eSKeith Busch
16813bac55eSKeith BuschThe "size" is the number of bytes provided by this cache level.
16913bac55eSKeith Busch
17013bac55eSKeith BuschThe "write_policy" will be 0 for write-back, and non-zero for
17113bac55eSKeith Buschwrite-through caching.
17213bac55eSKeith Busch
17313bac55eSKeith Busch========
17413bac55eSKeith BuschSee Also
17513bac55eSKeith Busch========
1762e03e3a4SMauro Carvalho Chehab
1772e03e3a4SMauro Carvalho Chehab[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
1782e03e3a4SMauro Carvalho Chehab- Section 5.2.27
179