113bac55eSKeith Busch.. _numaperf: 213bac55eSKeith Busch 313bac55eSKeith Busch============= 413bac55eSKeith BuschNUMA Locality 513bac55eSKeith Busch============= 613bac55eSKeith Busch 713bac55eSKeith BuschSome platforms may have multiple types of memory attached to a compute 813bac55eSKeith Buschnode. These disparate memory ranges may share some characteristics, such 913bac55eSKeith Buschas CPU cache coherence, but may have different performance. For example, 1013bac55eSKeith Buschdifferent media types and buses affect bandwidth and latency. 1113bac55eSKeith Busch 1213bac55eSKeith BuschA system supports such heterogeneous memory by grouping each memory type 1313bac55eSKeith Buschunder different domains, or "nodes", based on locality and performance 1413bac55eSKeith Buschcharacteristics. Some memory may share the same node as a CPU, and others 1513bac55eSKeith Buschare provided as memory only nodes. While memory only nodes do not provide 1613bac55eSKeith BuschCPUs, they may still be local to one or more compute nodes relative to 1713bac55eSKeith Buschother nodes. The following diagram shows one such example of two compute 188867f610SJonathan Corbetnodes with local memory and a memory only node for each of compute node:: 1913bac55eSKeith Busch 2013bac55eSKeith Busch +------------------+ +------------------+ 2113bac55eSKeith Busch | Compute Node 0 +-----+ Compute Node 1 | 2213bac55eSKeith Busch | Local Node0 Mem | | Local Node1 Mem | 2313bac55eSKeith Busch +--------+---------+ +--------+---------+ 2413bac55eSKeith Busch | | 2513bac55eSKeith Busch +--------+---------+ +--------+---------+ 2613bac55eSKeith Busch | Slower Node2 Mem | | Slower Node3 Mem | 2713bac55eSKeith Busch +------------------+ +--------+---------+ 2813bac55eSKeith Busch 2913bac55eSKeith BuschA "memory initiator" is a node containing one or more devices such as 3013bac55eSKeith BuschCPUs or separate memory I/O devices that can initiate memory requests. 3113bac55eSKeith BuschA "memory target" is a node containing one or more physical address 3213bac55eSKeith Buschranges accessible from one or more memory initiators. 3313bac55eSKeith Busch 3413bac55eSKeith BuschWhen multiple memory initiators exist, they may not all have the same 3513bac55eSKeith Buschperformance when accessing a given memory target. Each initiator-target 3613bac55eSKeith Buschpair may be organized into different ranked access classes to represent 3713bac55eSKeith Buschthis relationship. The highest performing initiator to a given target 3813bac55eSKeith Buschis considered to be one of that target's local initiators, and given 3913bac55eSKeith Buschthe highest access class, 0. Any given target may have one or more 4013bac55eSKeith Buschlocal initiators, and any given initiator may have multiple local 4113bac55eSKeith Buschmemory targets. 4213bac55eSKeith Busch 4313bac55eSKeith BuschTo aid applications matching memory targets with their initiators, the 4413bac55eSKeith Buschkernel provides symlinks to each other. The following example lists the 4513bac55eSKeith Buschrelationship for the access class "0" memory initiators and targets:: 4613bac55eSKeith Busch 4713bac55eSKeith Busch # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ 4813bac55eSKeith Busch relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY 4913bac55eSKeith Busch 5013bac55eSKeith Busch # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ 5113bac55eSKeith Busch relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX 5213bac55eSKeith Busch 5313bac55eSKeith BuschA memory initiator may have multiple memory targets in the same access 5413bac55eSKeith Buschclass. The target memory's initiators in a given class indicate the 5513bac55eSKeith Buschnodes' access characteristics share the same performance relative to other 5613bac55eSKeith Buschlinked initiator nodes. Each target within an initiator's access class, 5713bac55eSKeith Buschthough, do not necessarily perform the same as each other. 5813bac55eSKeith Busch 59dc9e7860SJonathan CameronThe access class "1" is used to allow differentiation between initiators 60dc9e7860SJonathan Cameronthat are CPUs and hence suitable for generic task scheduling, and 61dc9e7860SJonathan CameronIO initiators such as GPUs and NICs. Unlike access class 0, only 62dc9e7860SJonathan Cameronnodes containing CPUs are considered. 63dc9e7860SJonathan Cameron 6413bac55eSKeith Busch================ 6513bac55eSKeith BuschNUMA Performance 6613bac55eSKeith Busch================ 6713bac55eSKeith Busch 6813bac55eSKeith BuschApplications may wish to consider which node they want their memory to 6913bac55eSKeith Buschbe allocated from based on the node's performance characteristics. If 7013bac55eSKeith Buschthe system provides these attributes, the kernel exports them under the 7113bac55eSKeith Buschnode sysfs hierarchy by appending the attributes directory under the 7213bac55eSKeith Buschmemory node's access class 0 initiators as follows:: 7313bac55eSKeith Busch 7413bac55eSKeith Busch /sys/devices/system/node/nodeY/access0/initiators/ 7513bac55eSKeith Busch 7613bac55eSKeith BuschThese attributes apply only when accessed from nodes that have the 77751d5b27SAndrew Klychkovare linked under the this access's initiators. 7813bac55eSKeith Busch 7913bac55eSKeith BuschThe performance characteristics the kernel provides for the local initiators 8013bac55eSKeith Buschare exported are as follows:: 8113bac55eSKeith Busch 8213bac55eSKeith Busch # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ 8313bac55eSKeith Busch /sys/devices/system/node/nodeY/access0/initiators/ 8413bac55eSKeith Busch |-- read_bandwidth 8513bac55eSKeith Busch |-- read_latency 8613bac55eSKeith Busch |-- write_bandwidth 8713bac55eSKeith Busch `-- write_latency 8813bac55eSKeith Busch 8913bac55eSKeith BuschThe bandwidth attributes are provided in MiB/second. 9013bac55eSKeith Busch 9113bac55eSKeith BuschThe latency attributes are provided in nanoseconds. 9213bac55eSKeith Busch 9313bac55eSKeith BuschThe values reported here correspond to the rated latency and bandwidth 9413bac55eSKeith Buschfor the platform. 9513bac55eSKeith Busch 96dc9e7860SJonathan CameronAccess class 1 takes the same form but only includes values for CPU to 97dc9e7860SJonathan Cameronmemory activity. 98dc9e7860SJonathan Cameron 9913bac55eSKeith Busch========== 10013bac55eSKeith BuschNUMA Cache 10113bac55eSKeith Busch========== 10213bac55eSKeith Busch 10313bac55eSKeith BuschSystem memory may be constructed in a hierarchy of elements with various 10413bac55eSKeith Buschperformance characteristics in order to provide large address space of 10513bac55eSKeith Buschslower performing memory cached by a smaller higher performing memory. The 10613bac55eSKeith Buschsystem physical addresses memory initiators are aware of are provided 10713bac55eSKeith Buschby the last memory level in the hierarchy. The system meanwhile uses 10813bac55eSKeith Buschhigher performing memory to transparently cache access to progressively 10913bac55eSKeith Buschslower levels. 11013bac55eSKeith Busch 11113bac55eSKeith BuschThe term "far memory" is used to denote the last level memory in the 11213bac55eSKeith Buschhierarchy. Each increasing cache level provides higher performing 11313bac55eSKeith Buschinitiator access, and the term "near memory" represents the fastest 11413bac55eSKeith Buschcache provided by the system. 11513bac55eSKeith Busch 11613bac55eSKeith BuschThis numbering is different than CPU caches where the cache level (ex: 11713bac55eSKeith BuschL1, L2, L3) uses the CPU-side view where each increased level is lower 11813bac55eSKeith Buschperforming. In contrast, the memory cache level is centric to the last 11913bac55eSKeith Buschlevel memory, so the higher numbered cache level corresponds to memory 12013bac55eSKeith Buschnearer to the CPU, and further from far memory. 12113bac55eSKeith Busch 12213bac55eSKeith BuschThe memory-side caches are not directly addressable by software. When 12313bac55eSKeith Buschsoftware accesses a system address, the system will return it from the 12413bac55eSKeith Buschnear memory cache if it is present. If it is not present, the system 12513bac55eSKeith Buschaccesses the next level of memory until there is either a hit in that 12613bac55eSKeith Buschcache level, or it reaches far memory. 12713bac55eSKeith Busch 12813bac55eSKeith BuschAn application does not need to know about caching attributes in order 12913bac55eSKeith Buschto use the system. Software may optionally query the memory cache 13013bac55eSKeith Buschattributes in order to maximize the performance out of such a setup. 13113bac55eSKeith BuschIf the system provides a way for the kernel to discover this information, 13213bac55eSKeith Buschfor example with ACPI HMAT (Heterogeneous Memory Attribute Table), 13313bac55eSKeith Buschthe kernel will append these attributes to the NUMA node memory target. 13413bac55eSKeith Busch 13513bac55eSKeith BuschWhen the kernel first registers a memory cache with a node, the kernel 13613bac55eSKeith Buschwill create the following directory:: 13713bac55eSKeith Busch 13813bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/ 13913bac55eSKeith Busch 140eeb3dc58SRandy DunlapIf that directory is not present, the system either does not provide 14113bac55eSKeith Buscha memory-side cache, or that information is not accessible to the kernel. 14213bac55eSKeith Busch 14313bac55eSKeith BuschThe attributes for each level of cache is provided under its cache 14413bac55eSKeith Buschlevel index:: 14513bac55eSKeith Busch 14613bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/indexA/ 14713bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/indexB/ 14813bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/indexC/ 14913bac55eSKeith Busch 15013bac55eSKeith BuschEach cache level's directory provides its attributes. For example, the 15113bac55eSKeith Buschfollowing shows a single cache level and the attributes available for 15213bac55eSKeith Buschsoftware to query:: 15313bac55eSKeith Busch 154*abb9c078SMark O'Donovan # tree /sys/devices/system/node/node0/memory_side_cache/ 15513bac55eSKeith Busch /sys/devices/system/node/node0/memory_side_cache/ 15613bac55eSKeith Busch |-- index1 15713bac55eSKeith Busch | |-- indexing 15813bac55eSKeith Busch | |-- line_size 15913bac55eSKeith Busch | |-- size 16013bac55eSKeith Busch | `-- write_policy 16113bac55eSKeith Busch 16213bac55eSKeith BuschThe "indexing" will be 0 if it is a direct-mapped cache, and non-zero 16313bac55eSKeith Buschfor any other indexed based, multi-way associativity. 16413bac55eSKeith Busch 16513bac55eSKeith BuschThe "line_size" is the number of bytes accessed from the next cache 16613bac55eSKeith Buschlevel on a miss. 16713bac55eSKeith Busch 16813bac55eSKeith BuschThe "size" is the number of bytes provided by this cache level. 16913bac55eSKeith Busch 17013bac55eSKeith BuschThe "write_policy" will be 0 for write-back, and non-zero for 17113bac55eSKeith Buschwrite-through caching. 17213bac55eSKeith Busch 17313bac55eSKeith Busch======== 17413bac55eSKeith BuschSee Also 17513bac55eSKeith Busch======== 1762e03e3a4SMauro Carvalho Chehab 1772e03e3a4SMauro Carvalho Chehab[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf 1782e03e3a4SMauro Carvalho Chehab- Section 5.2.27 179