Bug 12286 - [devel-6.6] Intel RDT monitoring support on Sub-NUMA Cluster (SNC) enabled systems
Summary: [devel-6.6] Intel RDT monitoring support on Sub-NUMA Cluster (SNC) enabled sy...
Status: NEW
Alias: None
Product: ANCK 6.6 Dev
Classification: ANCK
Component: X86 (show other bugs) X86
Version: unspecified
Hardware: x86_64 Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: Guanjun
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-10 13:24 UTC by zhiquan1-li
Modified: 2024-12-10 13:24 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description zhiquan1-li intel_group 2024-12-10 13:24:58 UTC
Description of problem:

About Intel RDT with Sub-NUMA Cluster (SNC) enabled

Supported Intel platforms: ICX, SPR, EMR, GNR and etc.

The Sub-NUMA Cluster (SNC) feature on some Intel processors partitions the
CPUs that share an L3 cache into two or more sets. This plays havoc with
the Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPUs support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. Legacy monitoring
files provide the sum of counters from each SNC node for backwards
compatibility. Additional files per SNC node provide details per node.

With Sub-NUMA Cluster (SNC) mode enabled the scope of monitoring resources
is per-NODE instead of per-L3 cache. Backwards compatibility is maintained
by providing files in the mon_L3_XX directories that sum event counts
for all SNC nodes sharing an L3 cache.

The top-level monitoring files in each "mon_L3_XX" directory provide the
sum of data across all SNC nodes sharing an L3 cache instance.
Users who bind tasks to the CPUs of a specific Sub-NUMA node can read the
"llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
"mon_sub_L3_YY" directories to get node local data.

For a system with two SNC nodes per L3, the monitor reporting files like this:

# cd /sys/fs/resctrl/mon_data
# tree mon_L3_00
mon_L3_00                  <- 00 here is L3 cache id
├── llc_occupancy              \  These files provide legacy support
├── mbm_local_bytes             > for non-SNC aware monitor apps
├── mbm_total_bytes            /  that expect data at L3 cache level
├── mon_sub_L3_00          <- 00 here is SNC node id
│   ├── llc_occupancy          \  These files are finer grained
│   ├── mbm_local_bytes         > data from each SNC node
│   └── mbm_total_bytes        /
└── mon_sub_L3_01
    ├── llc_occupancy          \
    ├── mbm_local_bytes         > As above, but for node 1.
    └── mbm_total_bytes        /

Note for Intel RDT control features:
Memory bandwidth allocation is still performed at the L3 cache level. I.e.
throttling controls are applied to all SNC nodes.

L3 cache allocation bitmaps also apply to all SNC nodes. But note that the
amount of L3 cache represented by each bit is divided by the number of SNC
nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
allocation masks each bit normally represents 10MB. With SNC mode enabled
with two SNC nodes per L3 cache, each bit only represents 5MB.