Skip to content

Conversation

@zhang-rui
Copy link
Contributor

@zhang-rui zhang-rui commented Sep 24, 2025

Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Note: this PR also includes all the patches in #60, because it depends on the CWF CPUID support in that PR.

Tests:
Tested on GNR, the error injection works as expected, the Machine check exceptions are dumped with extra RRL information.
Tested on CWF, the EDAC/Error injection starts to work. the Machine check exceptions are dumped together with RRL information.

aegl and others added 5 commits August 27, 2025 22:33
commit 090e3be upstream.

Server product based on the Atom Darkmont core.

Intel-SIG: commit 090e3be x86/cpu: Add model number for Intel Clearwater Forest processor.
BACKPORTING NEW CPU IFM

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 8a8a9c9 upstream.

This one is the regular laptop CPU.

Intel-SIG: commit 8a8a9c9 x86/cpu: Add model number for another Intel Arrow Lake mobile processor.
BACKPORTING NEW CPU IFM

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit a9d0adc upstream.

Refactor struct cpuinfo_x86 so that the vendor, family, and model
fields are overlaid in a union with a 32-bit field that combines
all three (together with a one byte reserved field in the upper
byte).

This will make it easy, cheap, and reliable to check all three
values at once.

See

  https://lore.kernel.org/r/Zgr6kT8oULbnmEXx@agluck-desk3

for why the ordering is (low-to-high bits):

  (vendor, family, model)

  [ bp: Move comments over the line, add the backstory about the
    particular order of the fields. ]

Intel-SIG: commit a9d0adc x86/cpu/vfm: Add/initialize x86_vfm field to struct cpuinfo_x86.
BACKPORTING NEW CPU IFM

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit f055b62 upstream.

New CPU #defines encode vendor and family as well as model.

Update the example usage comment in arch/x86/kernel/cpu/match.c

Intel-SIG: commit f055b62 x86/cpu/vfm: Update arch/x86/include/asm/intel-family.h.
BACKPORTING NEW CPU IFM

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 8043832 upstream.

Introduce numa_valid_node(nid) that verifies that nid is a valid node ID
and use that instead of comparing nid parameter with either NUMA_NO_NODE
or MAX_NUMNODES.

This makes the checks for valid node IDs consistent and more robust and
allows to get rid of multiple WARNings.

Intel-SIG: commit 8043832 memblock: use numa_valid_node() helper to check for invalid node ID
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
RichardWeiYang and others added 18 commits November 7, 2025 14:44
commit 9364a7e upstream.

commit 8043832 ("memblock: use numa_valid_node() helper to check
for invalid node ID") introduce a new helper numa_valid_node(), which is
not defined in memblock tests.

Let's add it in the corresponding header file.

Intel-SIG: commit 9364a7e memblock tests: fix implicit declaration of function 'numa_valid_node'
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Wei Yang <[email protected]>
CC: Mike Rapoport (IBM) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 2397f795735219caa9c2fe61e7bcdd0652e670d3 upstream.

The current skx_common determines whether the memory error source is the
near memory of the 2LM system and then retrieves the decoded error results
from the ADXL components (near-memory vs. far-memory) accordingly.

However, some memory controllers may have limitations in correctly
reporting the memory error source, leading to the retrieval of incorrect
decoded parts from the ADXL.

To address these limitations, instead of simply determining whether the
memory error is from the near memory of the 2LM system, it is necessary to
distinguish the memory error source details as follows:

  Memory error from the near memory of the 2LM system.
  Memory error from the far memory of the 2LM system.
  Memory error from the 1LM system.
  Not a memory error.

This will enable the i10nm_edac driver to take appropriate actions for
those memory controllers that have limitations in reporting the memory
error source.

Intel-SIG: commit 2397f7957352 EDAC/skx_common: Differentiate memory error sources
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Fixes: ba987ea ("EDAC/i10nm: Add Intel Granite Rapids server support")
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Diego Garcia Rodriguez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit a36667037a0c0e36c59407f8ae636295390239a5 upstream.

The Granite Rapids CPUs with Flat2LM memory configurations may
mistakenly report near-memory errors as far-memory errors, resulting
in the invalid decoded ADXL results:

  EDAC skx: Bad imc -1

Fix this incorrect far-memory error source indicator by prefetching the
decoded far-memory controller ID, and adjust the error source indicator
to near-memory if the far-memory controller ID is invalid.

Intel-SIG: commit a36667037a0c EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Fixes: ba987ea ("EDAC/i10nm: Add Intel Granite Rapids server support")
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Diego Garcia Rodriguez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit e77086c upstream.

The Grand Ridge CPU model uses similar memory controller registers with
Granite Rapids server. Add Grand Ridge CPU model ID for EDAC support.

Intel-SIG: commit e77086c EDAC/i10nm: Add Intel Grand Ridge micro-server support
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Tested-by: Ricardo Neri <[email protected]>
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
…pids

commit d9207cf7760f5f5599e9ff7eb0fedf56821a1d59 upstream.

When doing error injection to some memory DIMMs on certain Intel Emerald
Rapids servers, the i10nm_edac missed error reports for some memory DIMMs.

Certain BIOS configurations may hide some memory controllers, and the
i10nm_edac doesn't enumerate these hidden memory controllers. However, the
ADXL decodes memory errors using memory controller physical indices even
if there are hidden memory controllers. Therefore, the memory controller
physical indices reported by the ADXL may mismatch the logical indices
enumerated by the i10nm_edac, resulting in missed error reports for some
memory DIMMs.

Fix this issue by creating a mapping table from memory controller physical
indices (used by the ADXL) to logical indices (used by the i10nm_edac) and
using it to convert the physical indices to the logical indices during the
error handling process.

Intel-SIG: commit d9207cf7760f EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Fixes: c545f5e ("EDAC/i10nm: Skip the absent memory controllers")
Reported-by: Kevin Chang <[email protected]>
Tested-by: Kevin Chang <[email protected]>
Reported-by: Thomas Chen <[email protected]>
Tested-by: Thomas Chen <[email protected]>
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
…nnel 0

commit eeed3e03f4261e5e381a72ae099ff00ccafbb437 upstream.

When enabling the retry_rd_err_log (RRL) feature during the loading of the
i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL
control mode), the default values of the control bits of RRL are saved so
that they can be restored during the unloading of the driver.

In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo
channel 0 during the loading of the driver, resulting in the loss of saved
RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to
be wrongly restored with the values from pseudo channel 1 when unloading
the driver.

Fix this issue by creating two separate groups of RRL control registers
per channel to save default RRL settings of two {sub-,pseudo-}channels.

Intel-SIG: commit eeed3e03f426 EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Fixes: acd4cf6 ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM")
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Feng Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 8b93582 upstream.

Commit

  afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module")

made skx_common.o a separate module. With skx_common.o now a separate
module, move the common debug code setup_{skx,i10nm}_debug() and
teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to
reduce code duplication. Additionally, prefix these function names with
'skx' to maintain consistency with other names in the file.

Intel-SIG: commit 8b93582 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 7a33c14 upstream.

The configuration flag 'res_config->support_ddr5 = true' sufficiently
indicates DDR5 memory support for Sapphire Rapids and Granite Rapids.
Additionally, the i10nm_edac driver doesn't need to use the AMAP
register for setting the 'fine_grain_bank' of each DIMM. Therefore,
remove the AMAP register for determining DDR5.

Intel-SIG: commit 7a33c14 EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 2e55bb9b71e179c37d05deff37daa0dd8d04b59d upstream.

Clearwater Forest is the successor to Sierra Forest. Add Clearwater
Forest CPU model ID for EDAC support.

Intel-SIG: commit 2e55bb9b71e1 EDAC/i10nm: Add Intel Clearwater Forest server support
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Yi Lai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: resolve conflict (use old X86 Macro) and amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 584e09743d2f44905290b0dbf3215064d2a1888c upstream.

The 3-bit source IDs in PCI configuration space registers, used to map
devices to sockets, are limited to 8 unique IDs, and each ID is local to
a UPI/QPI domain.

Source IDs cannot be used to map devices to sockets on UV systems
because they can exceed 8 sockets and have multiple UPI/QPI domains with
identical, repeating source IDs.

Use NUMA information to get package IDs instead of source IDs on UV
systems, and use package/source IDs to name IMC information structures.

Intel-SIG: commit 584e09743d2f EDAC/{i10nm,skx,skx_common}: Support UV systems
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Kyle Meyer <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Reviewed-by: Qiuxu Zhuo <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
[ Zhang Rui: resolve conflict (use topology_physical_package_id) and amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 4878e1e90056230cefd580136d0e6d5689a7b770 upstream.

The i10nm_edac driver uses the default modes (either patrol scrub read
or on-demand read) of the RRL register sets configured by the BIOS.

Explicitly set the modes during the loading of the i10nm_edac driver with
the module parameter retry_rd_err_log=2.

Intel-SIG: commit 4878e1e90056 EDAC/i10nm: Explicitly set the modes of the RRL register sets
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Feng Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 1a8a6af663a7f16c9b2779cf728187775735047b upstream.

As the number of RRL (retry_rd_err_log) registers per memory channel
increases, the positions of the RRL control bits and the widths of the
RRL registers vary across different CPU generations. Adding RRL support
for a new CPU requires handling these differences throughout the
RRL-related code.

Structure the offsets, widths, control bit positions, set numbers, modes,
etc., of the per-channel RRL registers and make them configurable to
facilitate easier RRL support for new CPUs.

No functional changes are intended.

Intel-SIG: commit 1a8a6af663a7 EDAC/{skx_common,i10nm}: Structure the per-channel RRL registers
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Feng Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit ba3985c1faf5eb72084ddc31204b076c2a450263 upstream.

Refactor enable_retry_rd_err_log() using helper functions for both
DDR and HBM, making the RRL control bits configurable instead of
hard-coded. Additionally, explicitly define the four RRL modes for
better readability.

No functional changes intended.

Intel-SIG: commit ba3985c1faf5 EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log()
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Feng Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 126168fa2c3e16113ea75a656fff5156a54a5726 upstream.

Make the {valid bit, overwritten status, number} of RRL registers and the
{number, offsets, widths} of per-channel CORRERRCNT registers configurable.
Refactor show_retry_rd_err_log() to use the configurable fields of struct
reg_rrl, making the code more scalable and simpler.

No functional changes intended.

Intel-SIG: commit 126168fa2c3e EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log()
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Feng Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 5904dc561ef21e69f0b9dca39d1a66e34b7ea764 upstream.

Compared to previous generations, Granite Rapids defines the RRL control
bits {en_patspr, noover, en} in different positions, adds an extra RRL set
for the new mode of the first patrol-scrub read error, and extends the
number of CORRERRCNT registers from 4 to 8, encoding one counter per
CORRERRCNT register.

Add a Granite Rapids reg_rrl configuration table and adjust the code to
accommodate the differences mentioned above for RRL support.

Intel-SIG: commit 5904dc561ef2 EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids server
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Feng Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
…sizes

commit 2b2408aca90b86c1ef51c19d834e5f6db0a1ff30 upstream.

The tool of Smatch static checker reported the following warning:

  drivers/edac/i10nm_base.c:364 show_retry_rd_err_log()
  warn: should bitwise negate be 'ullong'?

This warning was due to the bitwise NOT/AND operations between
'status_mask' (a u32 type) and 'log' (a u64 type), which resulted in
the high 32 bits of 'log' were cleared.

This was a false positive warning, as only the low 32 bits of 'log' was
written to the first RRL memory controller register (a u32 type).

To improve code sanity, fix this warning by changing 'status_mask' to
a u64 type, ensuring it matches the size of 'log' for bitwise operations.

Intel-SIG: commit 2b2408aca90b EDAC/i10nm: Fix the bitwise operation between variables of different sizes
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Reported-by: Dan Carpenter <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 9ad08c1115646533097c8a799ad046bf5127b04a upstream.

The Granite Rapids-D CPU model uses memory controller registers similar
to those of the Granite Rapids server CPU but with a different memory
controller MMIO base.

Add the Granite Rapids-D CPU model ID and use the new memory controller
MMIO base for EDAC support.

Intel-SIG: commit 9ad08c111564 EDAC/i10nm: Add Intel Granite Rapids-D support
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: VikasX Chougule <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: resolve conflict (use old X86 Macro) and amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
commit 35928bc38db69a2af26624e35a250c1e0f9a6a3f upstream.

snprintf() is fragile when its return value will be used to append
additional data to a buffer. Use scnprintf() instead.

Intel-SIG: commit 35928bc38db6 EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handling
Add EDAC basic support and RRL enhancement for CWF/SRF/GNR/GNR-D

Signed-off-by: Wang Haoran <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Reviewed-by: Qiuxu Zhuo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Zhang Rui: amend commit log ]
Signed-off-by: Zhang Rui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants