Skip to content

SNC and membw

Roman Storozhenko edited this page Mar 11, 2024 · 5 revisions

membw usage scenario

Typically if one want to populate a cache they just run: ./membw -c <core_number> -b <MB/s> --write

The tool allocates a chunk of memory as much as twice of socket's L3 physical cache size and this is enough to populate 100% of the cache.

membw limitations in presence of SNC

The situation changes when SNC is enabled. membw tool uses standard means of the C library for the memory allocation. Therefore it is affected by all OS memory placement policy limitations such as NUMA-awareness. NUMA-aware OS allocates only addresses local to the SNC-domain and only SNC-domain local cache slices are populated. In the case of SNC-2 it populates 50% of cache slices. For SNC-3 the population will be 33%. For SNC-4 only 25% of the cache will be populated. The conclusion is that if one wants to run a workload that uses 100% socket cache occupancy they should distribute it across all the SNC domains on the socket, using at lease one CPU core from each domain.

Example of membw usage for 100% cache population in presence of SNC-2

System configuration excerpts obtained with lspcu:

CPU(s):                  256
  On-line CPU(s) list:   0-255
Vendor ID:               GenuineIntel
    Thread(s) per core:  2
    Core(s) per socket:  64
    Socket(s):           2
Caches (sum of all):
  L1d:                   6 MiB (128 instances)
  L1i:                   4 MiB (128 instances)
  L2:                    256 MiB (128 instances)
  L3:                    640 MiB (2 instances)
NUMA:
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-31,128-159
  NUMA node1 CPU(s):     32-63,160-191
  NUMA node2 CPU(s):     64-95,192-223
  NUMA node3 CPU(s):     96-127,224-255

The test system is a 2 socket system, with 256 logical CPUs total or 128 per socket. Let's consider socket 0. It contains 2 NUMA nodes 0 and 1, that is, SNC-Domanins 0 and 1. SNC 0 domain has 2 ranges of CPUs: 0-31, 128-159. SNC 1 domain contains 2 ranges of CPUs: 32-36, 160-191. The system L3 cache size is 640MiB: 320MiB per socket, 160MiB per SNC domain. To populate the 100% of cache occupancy on physical socket 0 we have to choose 2 random cores from SNC domains 0 and 1 and then run membw against each of those. Let's choose core 4 from SNC 0 and core 32 from SNC-1 (notice that binding memory allocation to both SNC domain 0 and 1 won't render to population more than 160MiB, that is addresses allocated on the local SNC domain):

cache-qos-source/tools/membw$ numactl --membind=0,1 ./membw -c 4 -b 2400 –write

cache-qos-source/tools/membw$ numactl --membind=0,1 ./membw -c 32 -b 2400 --write

Measure the cache occupancy on the cores under load:

$ LD_LIBRARY_PATH=lib cache-qos-source/pqos/pqos --iface=msr -r snc-total -m llc:4,32

The command will show 160Mib cache occupancy for each of the cores: 4 and 32

Summing up the values gives us ~100% cache occupation in SNC-total mode (The socket cache size is 320Mib in this machine).

Clone this wiki locally