Skip to content

Conversation

@bhe4
Copy link
Contributor

@bhe4 bhe4 commented Oct 29, 2025

It's combined the below PRs for 6.6 kernel test purpose:
#70 opened on Sep 24 by zhang-rui

#69 opened on Sep 24 by zhang-rui

#68 opened on Sep 24 by zhang-rui

#67 opened on Sep 24 by zhang-rui

#66 opened on Sep 24 by zhang-rui

#65 opened on Sep 24 by zhang-rui

#64 opened on Sep 24 by x56Jason

#62 opened on Sep 18 by x56Jason

#60 opened on Aug 19 by quanxianwang

#59 opened on Aug 7 by quanxianwang

#49 opened on Jun 10 by jiayingbao

#48 opened on Jun 10 by jiayingbao

#45 opened on Mar 31 by zhiquan1-li

#43 opened on Mar 27 by AichunShi
1

#41 opened on Mar 27 by x56Jason

#39 opened on Mar 27 by EthanZHF

#36 opened on Mar 27 by quanxianwang

#31 opened on Mar 11 by aubreyli

kees and others added 30 commits October 29, 2025 19:25
commit 2e89345 upstream.

Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct prm_module_info.

Intel-SIG: commit 2e89345 ACPI: PRM: Annotate struct prm_module_info with __counted_by.
Backport PRM update and bugfixes up to v6.14.

Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci # [1]
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <[email protected]>
commit f0fcdd2 upstream.

Platform Runtime Mechanism (PRM) handlers can be invoked from either the AML
interpreter or directly by an OS driver. Implement the latter.

  [ bp: Massage commit message. ]

Intel-SIG: commit f0fcdd2 PRM: Add PRM handler direct call support.
Backport PRM update and bugfixes up to v6.14.

Signed-off-by: John Allen <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <[email protected]>
commit 090e3be upstream.

Server product based on the Atom Darkmont core.

Intel-SIG: commit 090e3be x86/cpu: Add model number for Intel Clearwater Forest processor.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit a0423af92cb31e6fc4f53ef9b6e19fdf08ad4395 upstream.

Latest Intel platform Clearwater Forest has introduced new instructions
enumerated by CPUIDs of SHA512, SM3, SM4 and AVX-VNNI-INT16. Advertise
these CPUIDs to userspace so that guests can query them directly.

SHA512, SM3 and SM4 are on an expected-dense CPUID leaf and some other
bits on this leaf have kernel usages. Considering they have not truly
kernel usages, hide them in /proc/cpuinfo.

These new instructions only operate in xmm, ymm registers and have no new
VMX controls, so there is no additional host enabling required for guests
to use these instructions, i.e. advertising these CPUIDs to userspace is
safe.

Intel-SIG: commit a0423af92cb3 x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest.

Tested-by: Jiaan Lu <[email protected]>
Tested-by: Xuelian Guo <[email protected]>
Signed-off-by: Tao Su <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit f91f2a9 upstream.

A new DSA device ID, 0x11fb, is introduced for the Granite Rapids-D
platform. Add the device ID to the IDXD driver.

Since a potential security issue has been fixed on the new device, it's
secure to assign the device to virtual machines, and therefore, the new
device ID will not be added to the VFIO denylist. Additionally, the new
device ID may be useful in identifying and addressing any other potential
issues with this specific device in the future. The same is also applied
to any other new DSA/IAA devices with new device IDs.

Intel-SIG: commit f91f2a9 dmaengine: idxd: Add a new DSA device ID
for Granite Rapids-D platform
Add GNR new idxd id support.

Signed-off-by: Fenghua Yu <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
(cherry picked from commit f91f2a9)
Signed-off-by: Ethan Zhao <[email protected]>
commit b628cb5 upstream.

Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max
mappable GPA to userspace, i.e. the max GPA that is addressable by the
CPU itself.  Typically this is identical to the max effective GPA, except
in the case where the CPU supports MAXPHYADDR > 48 but does not support
5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the
fifth level TDP page table entry).

Enumerating the max mappable GPA via CPUID will allow guest firmware to
map resources like PCI bars in the highest possible address space, while
ensuring that the GPA is addressable by the CPU.  Without precise
knowledge about the max mappable GPA, the guest must assume that 5-level
paging is unsupported and thus restrict its mappings to the lower 48 bits.

Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace
doesn't have easy access to whether or not 5-level paging is supported,
and to play nice with userspace VMMs that reflect the supported CPUID
directly into the guest.

AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as:

  Maximum guest physical address size in bits.  This number applies
  only to guests using nested paging.  When this field is zero, refer
  to the PhysAddrSize field for the maximum guest physical address size.

Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
and KVM can use it as described above.  Real hardware always returns zero.

Leave GuestPhysBits as '0' when TDP is disabled in order to comply with
the APM's statement that GuestPhysBits "applies only to guest using nested
paging".  As above, guest firmware will likely create suboptimal mappings,
but that is a very minor issue and not a functional concern.

Intel-SIG: commit b628cb5 KVM: x86: Advertise max mappable GPA in CPUID.0x80000008.GuestPhysBits

Signed-off-by: Gerd Hoffmann <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 980b8bc upstream.

Use the max mappable GPA via GuestPhysBits advertised by KVM to calculate
max_gfn. Currently some selftests (e.g. access_tracking_perf_test,
dirty_log_test...) add RAM regions close to max_gfn, so guest may access
GPA beyond its mappable range and cause infinite loop.

Adjust max_gfn in vm_compute_max_gfn() since x86 selftests already
overrides vm_compute_max_gfn() specifically to deal with goofy edge cases.

Intel-SIG: commit 980b8bc KVM: selftests: x86: Prioritize getting max_gfn from GuestPhysBits

Reported-by: Yi Lai <[email protected]>
Signed-off-by: Tao Su <[email protected]>
Tested-by: Yi Lai <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: tweak name, add comment and sanity check]
Signed-off-by: Sean Christopherson <[email protected]>

 Conflicts:
	tools/testing/selftests/kvm/include/x86_64/processor.h
[jz: resolve simple context conflict]
Signed-off-by: Jason Zeng <[email protected]>
commit 8b93582 upstream.

Commit

  afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module")

made skx_common.o a separate module. With skx_common.o now a separate
module, move the common debug code setup_{skx,i10nm}_debug() and
teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to
reduce code duplication. Additionally, prefix these function names with
'skx' to maintain consistency with other names in the file.

Intel-SIG: commit 8b93582 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
Backport to fix EDAC driver for GNR

Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
[ Aichun Shi: amend commit log ]
Signed-off-by: Aichun Shi <[email protected]>
commit 7efb4d8 upstream.

When SGX EDECCSSA support was added to KVM in commit 16a7fe3
("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest"), it
forgot to clear the X86_FEATURE_SGX_EDECCSSA bit in KVM CPU caps when
KVM SGX is disabled.  Fix it.

Intel-SIG: commit 7efb4d8 KVM: VMX: Also clear SGX EDECCSSA in KVM
CPU caps when SGX is disabled
Backport a fix for the KVM exposing the SGX EDECCSSA capability.

Fixes: 16a7fe3 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest")
Signed-off-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
[ Zhiquan Li: amend commit log ]
Signed-off-by: Zhiquan Li <[email protected]>
…omain

commit bb9a9bf upstream.

The scope of uncore control is per power domain with TPMI.

There are two types of processor topologies can be presented by CPUID
extended topology leaf irrespective of the hardware architecture:

1. A die is not enumerated in CPUID. In this case there is only one die
in a package is visible. In this case there can be multiple power domains
in a single die.
2. A power domain in a package is enumerated as a die in CPUID. So
there is one power domain per die.

To allow die level controls, the current implementation creates a root
domain and aggregates all information from power domains in it. This
is well suited for configuration 1 above.

But for configuration 2 above, the root domain will present the same
information as present by power domain. So, no use of aggregating. To
check the configuration, call topology_max_dies_per_package(). If it is
more than one, avoid creating root domain.

Intel-SIG: commit bb9a9bf platform/x86/intel-uncore-freq: Do not present separate package-die domain.
Backport Intel uncore-freq driver elc support and update

Signed-off-by: Srinivas Pandruvada <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>
[ Yingbao Jia: amend commit log ]
Signed-off-by: Yingbao Jia <[email protected]>
…ntrol

commit bb516dc upstream.

Add efficiency latency control support to the TPMI uncore driver. This
defines two new threshold values for controlling uncore frequency, low
threshold and high threshold. When CPU utilization is below low threshold,
the user configurable floor latency control frequency can be used by the
system. When CPU utilization is above high threshold, the uncore frequency
is increased in 100MHz steps until power limit is reached.

Intel-SIG: commit bb516dc platform/x86/intel-uncore-freq: Add support for efficiency latency control.
Backport Intel uncore-freq driver elc support and update

Signed-off-by: Tero Kristo <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
[ Yingbao Jia: amend commit log ]
Signed-off-by: Yingbao Jia <[email protected]>
…fs interface

commit 24b6616 upstream.

Add the TPMI efficiency latency control fields to the sysfs interface.
The sysfs files are mapped to the TPMI uncore driver via the registered
uncore_read and uncore_write driver callbacks. These fields are not
populated on older non TPMI hardware.

Intel-SIG: commit 24b6616 platform/x86/intel-uncore-freq: Add efficiency latency control to sysfs interface.
Backport Intel uncore-freq driver elc support and update

Signed-off-by: Tero Kristo <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
[ Yingbao Jia: amend commit log ]
Signed-off-by: Yingbao Jia <[email protected]>
commit f557e0d1c2e6eb6af6d4468ed2c0ee91829370e2 upstream.

Add Granite Rapids Xeon D C-states support: C1, C1E, C6, and C6P.

The C-states are basically the same as in Granite Rapids Xeon SP/AP, but
characteristics (latency, target residency) are a bit different.

Intel-SIG: commit f557e0d1c2e6 intel_idle: add Granite Rapids Xeon D support.
Backport Intel idle GNR-D support.

Signed-off-by: Artem Bityutskiy <[email protected]>
Link: https://patch.msgid.link/[email protected]
[ rjw: Changelog edit ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
[ Yingbao Jia: amend commit log ]
Signed-off-by: Yingbao Jia <[email protected]>
commit eeed4bfbe9b96214162a09a7fbb7570fa9522ca4 upstream.

Clearwater Forest (CWF) SoC has the same C-states as Sierra Forest (SRF)
SoC.  Add CWF support by re-using the SRF C-states table.

Note: it is expected that CWF C-states will have same or very similar
characteristics as SRF C-states (latency and target residency).

However, there is a possibility that the characteristics will end up
being different enough when the CWF platform development is finished.
In that case, a separate CWF C-states table will be created and populated
with the CWF-specific characteristics (latency and target residency).

Intel-SIG: commit eeed4bfbe9b9 intel_idle: add Clearwater Forest SoC support.
Backport Intel idle CWF support.

Signed-off-by: Artem Bityutskiy <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Rafael J. Wysocki <[email protected]>
[ Yingbao Jia: amend commit log ]
Signed-off-by: Yingbao Jia <[email protected]>
commit 1c450ff upstream.

Advertise AVX10.1 related CPUIDs, i.e. report AVX10 support bit via
CPUID.(EAX=07H, ECX=01H):EDX[bit 19] and new CPUID leaf 0x24H so that
guest OS and applications can query the AVX10.1 CPUIDs directly. Intel
AVX10 represents the first major new vector ISA since the introduction of
Intel AVX512, which will establish a common, converged vector instruction
set across all Intel architectures[1].

AVX10.1 is an early version of AVX10, that enumerates the Intel AVX512
instruction set at 128, 256, and 512 bits which is enabled on
Granite Rapids. I.e., AVX10.1 is only a new CPUID enumeration with no
new functionality.   New features, e.g. Embedded Rounding and Suppress
All Exceptions (SAE) will be introduced in AVX10.2.

Advertising AVX10.1 is safe because there is nothing to enable for AVX10.1,
i.e. it's purely a new way to enumerate support, thus there will never be
anything for the kernel to enable. Note just the CPUID checking is changed
when using AVX512 related instructions, e.g. if using one AVX512
instruction needs to check (AVX512 AND AVX512DQ), it can check
((AVX512 AND AVX512DQ) OR AVX10.1) after checking XCR0[7:5].

The versions of AVX10 are expected to be inclusive, e.g. version N+1 is
a superset of version N. Per the spec, the version can never be 0, just
advertise AVX10.1 if it's supported in hardware. Moreover, advertising
AVX10_{128,256,512} needs to land in the same commit as advertising basic
AVX10.1 support, otherwise KVM would advertise an impossible CPU model.
E.g. a CPU with AVX512 but not AVX10.1/512 is impossible per the SDM.

As more and more AVX related CPUIDs are added (it would have resulted in
around 40-50 CPUID flags when developing AVX10), the versioning approach
is introduced. But incrementing version numbers are bad for virtualization.
E.g. if AVX10.2 has a feature that shouldn't be enumerated to guests for
whatever reason, then KVM can't enumerate any "later" features either,
because the only way to hide the problematic AVX10.2 feature is to set the
version to AVX10.1 or lower[2]. But most AVX features are just passed
through and don't have virtualization controls, so AVX10 should not be
problematic in practice, so long as Intel honors their promise that future
versions will be supersets of past versions.

[1] https://cdrdv2.intel.com/v1/dl/getContent/784267
[2] https://lore.kernel.org/all/[email protected]/

Intel-SIG: commit 1c450ff KVM: x86: Advertise AVX10.1 CPUID to userspace.
GNR AVX10.1 backporting

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Tao Su <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: minor changelog tweaks]
Signed-off-by: Sean Christopherson <[email protected]>
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 090e3be upstream.

Server product based on the Atom Darkmont core.

Intel-SIG: commit 090e3be x86/cpu: Add model number for Intel Clearwater Forest processor.
BACKPORTING NEW CPU IFM

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 664596bd98bb251dd417dfd3f9b615b661e1e44a upstream.

Hide the Intel Birch Stream SoC TCO WDT feature since it was removed.

On platforms with PCH TCO WDT, this redundant device might be rendering
errors like this:

[   28.144542] sysfs: cannot create duplicate filename '/bus/platform/devices/iTCO_wdt'

Intel-SIG: commit 664596bd98bb i2c: i801: Hide Intel Birch Stream SoC TCO WDT

Fixes: 8c56f9e ("i2c: i801: Add support for Intel Birch Stream SoC")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220320
Signed-off-by: Chiasheng Lee <[email protected]>
Cc: <[email protected]> # v6.7+
Reviewed-by: Mika Westerberg <[email protected]>
Reviewed-by: Jarkko Nikula <[email protected]>
Signed-off-by: Andi Shyti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Conflicts:
	drivers/i2c/busses/i2c-i801.c
[jz: resolve context conflicts]
Signed-off-by: Jason Zeng <[email protected]>
…the kernel

commit 76db7aa upstream.

Sync the new sample type for the branch counters feature.

Signed-off-by: Kan Liang <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Bayduraev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Tinghao Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 72b8b94 upstream.

Sort header files alphabetically.

Intel-SIG: commit 72b8b94 powercap: intel_rapl: Sort header files
Backport TPMI based RAPL PMU support for GNR and future Xeons.

Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 575024a upstream.

Introduce two new APIs rapl_package_add_pmu()/rapl_package_remove_pmu().

RAPL driver can invoke these APIs to expose its supported energy
counters via perf PMU. The new RAPL PMU is fully compatible with current
MSR RAPL PMU, including using the same PMU name and events
name/id/unit/scale, etc.

For example, use below command
 perf stat -e power/energy-pkg/ -e power/energy-ram/ FOO
to get the energy consumption if power/energy-pkg/ and power/energy-ram/
events are available in the "perf list" output.

This does not introduce any conflict because TPMI RAPL is the only user
of these APIs currently, and it never co-exists with MSR RAPL.

Note that RAPL Packages can be probed/removed dynamically, and the
events supported by each TPMI RAPL device can be different. Thus the
RAPL PMU support is done on demand, which means
1. PMU is registered only if it is needed by a RAPL Package. PMU events
   for unsupported counters are not exposed.
2. PMU is unregistered and registered when a new RAPL Package is probed
   and supports new counters that are not supported by current PMU.
   For example, on a dual-package system using TPMI RAPL, it is possible
   that Package 1 behaves as TPMI domain root and supports Psys domain.
   In this case, register PMU without Psys event when probing Package 0,
   and re-register the PMU with Psys event when probing Package 1.
3. PMU is unregistered when all registered RAPL Packages don't need PMU.

Intel-SIG: commit 575024a powercap: intel_rapl: Introduce APIs for PMU support
Backport TPMI based RAPL PMU support for GNR and future Xeons.

Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 963a9ad upstream.

Enable RAPL PMU support for TPMI RAPL driver.

Intel-SIG: commit 963a9ad powercap: intel_rapl_tpmi: Enable PMU support
Backport TPMI based RAPL PMU support for GNR and future Xeons.

Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 0007f39 upstream.

The unit control address of some CXL units may be wrongly calculated
under some configuration on a EMR machine.

The current implementation only saves the unit control address of the
units from the first die, and the first unit of the rest of dies. Perf
assumed that the units from the other dies have the same offset as the
first die. So the unit control address of the rest of the units can be
calculated. However, the assumption is wrong, especially for the CXL
units.

Introduce an RB tree for each uncore type to save the unit control
address and three kinds of ID information (unit ID, PMU ID, and die ID)
for all units.
The unit ID is a physical ID of a unit.
The PMU ID is a logical ID assigned to a unit. The logical IDs start
from 0 and must be contiguous. The physical ID and the logical ID are
1:1 mapping. The units with the same physical ID in different dies share
the same PMU.
The die ID indicates which die a unit belongs to.

The RB tree can be searched by two different keys (unit ID or PMU ID +
die ID). During the RB tree setup, the unit ID is used as a key to look
up the RB tree. The perf can create/assign a proper PMU ID to the unit.
Later, after the RB tree is setup, PMU ID + die ID is used as a key to
look up the RB tree to fill the cpumask of a PMU. It's used more
frequently, so PMU ID + die ID is compared in the unit_less().
The uncore_find_unit() has to be O(N). But the RB tree setup only occurs
once during the driver load time. It should be acceptable.

Compared with the current implementation, more space is required to save
the information of all units. The extra size should be acceptable.
For example, on EMR, there are 221 units at most. For a 2-socket machine,
the extra space is ~6KB at most.

Intel-SIG: commit 0007f39 perf/x86/uncore: Save the unit control address of all units
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit c74443d upstream.

The cpumask of some uncore units, e.g., CXL uncore units, may be wrong
under some configurations. Perf may access an uncore counter of a
non-existent uncore unit.

The uncore driver assumes that all uncore units are symmetric among
dies. A global cpumask is shared among all uncore PMUs. However, some
CXL uncore units may only be available on some dies.

A per PMU cpumask is introduced to track the CPU mask of this PMU.
The driver searches the unit control RB tree to check whether the PMU is
available on a given die, and updates the per PMU cpumask accordingly.

Intel-SIG: commit c74443d perf/x86/uncore: Support per PMU cpumask
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 585463f upstream.

The box_ids only save the unit ID for the first die. If a unit, e.g., a
CXL unit, doesn't exist in the first die. The unit ID cannot be
retrieved.

The unit control RB tree also stores the unit ID information.
Retrieve the unit ID from the unit control RB tree

Intel-SIG: commit 585463f perf/x86/uncore: Retrieve the unit ID from the unit control RB tree
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 80580da upstream.

The unit control RB tree has the unit control and unit ID information
for all the units. Use it to replace the box_ctls/mmio_offsets to get
an accurate unit control address for MMIO uncore units.

Intel-SIG: commit 80580da perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit b1d9ea2 upstream.

The unit control RB tree has the unit control and unit ID information
for all the MSR units. Use them to replace the box_ctl and
uncore_msr_box_ctl() to get an accurate unit control address for MSR
uncore units.

Add intel_generic_uncore_assign_hw_event(), which utilizes the accurate
unit control address from the unit control RB tree to calculate the
config_base and event_base.

The unit id related information should be retrieved from the unit
control RB tree as well.

Intel-SIG: commit b1d9ea2 perf/x86/uncore: Apply the unit control RB tree to MSR uncore units
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit f76a842 upstream.

The unit control RB tree has the unit control and unit ID information
for all the PCI units. Use them to replace the box_ctls/pci_offsets to
get an accurate unit control address for PCI uncore units.

The UPI/M3UPI units in the discovery table are ignored. Please see the
commit 65248a9 ("perf/x86/uncore: Add a quirk for UPI on SPR").
Manually allocate a unit control RB tree for UPI/M3UPI.
Add cleanup_extra_boxes to release such manual allocation.

Intel-SIG: commit f76a842 perf/x86/uncore: Apply the unit control RB tree to PCI uncore units
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit 15a4bd5 upstream.

The unit control and ID information are retrieved from the unit control
RB tree. No one uses the old structure anymore. Remove them.

Intel-SIG: commit 15a4bd5 perf/x86/uncore: Cleanup unused unit structure
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit f8a86a9 upstream.

Unknown uncore PMON types can be found in both SPR and EMR with HBM or
CXL.

 $ls /sys/devices/ | grep type
 uncore_type_12_16
 uncore_type_12_18
 uncore_type_12_2
 uncore_type_12_4
 uncore_type_12_6
 uncore_type_12_8
 uncore_type_13_17
 uncore_type_13_19
 uncore_type_13_3
 uncore_type_13_5
 uncore_type_13_7
 uncore_type_13_9

The unknown PMON types are HBM and CXL PMON. Except for the name, the
other information regarding the HBM and CXL PMON counters can be
retrieved via the discovery table. Add them into the uncores tables for
SPR and EMR.

The event config registers for all CXL related units are 8-byte apart.
Add SPR_UNCORE_MMIO_OFFS8_COMMON_FORMAT to specially handle it.

Intel-SIG: commit f8a86a9 perf/x86/intel/uncore: Support HBM and CXL PMON counters
Backport SPR/EMR HBM and CXL PMON support to kernel v6.6

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
commit d4b5694 upstream.

From PMU's perspective, the SPR/GNR server has a similar uarch to the
ADL/MTL client p-core. Many functions are shared. However, the shared
function name uses the abbreviation of the server product code name,
rather than the common uarch code name.

Rename these internal shared functions by the common uarch name.

Intel-SIG: commit d4b5694 perf/x86/intel: Use the common uarch name for the shared functions
Backport as a dependency needed by the GNR distinct pmu name fix

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Yunying Sun: amend commit log ]
Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Jason Zeng <[email protected]>
H. Peter Anvin (Intel) and others added 14 commits October 29, 2025 20:57
commit 208d8c7 upstream.

Let cpu_init_exception_handling() call cpu_init_fred_exceptions() to
initialize FRED. However if FRED is unavailable or disabled, it falls
back to set up TSS IST and initialize IDT.

Intel-SIG: commit 208d8c7 x86/fred: Invoke FRED initialization
code to enable FRED
Backport FRED support.

Co-developed-by: Xin Li <[email protected]>
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 208d8c7)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
…ng to inline properly

commit cba9ff3 upstream.

Change array_index_mask_nospec() to __always_inline because "inline" is
broken as https://www.kernel.org/doc/local/inline.html.

Intel-SIG: commit cba9ff3 x86/fred: Fix a build warning with
allmodconfig due to 'inline' failing to inline properly
Backport FRED support.

Fixes: 6786137bf8fd ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit cba9ff3)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit e138419 upstream.

Add H. Peter Anvin and myself as FRED maintainers.

Intel-SIG: commit e138419 MAINTAINERS: Add a maintainer entry
for FRED
Backport FRED support.

Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Acked-by: H. Peter Anvin (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit e138419)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit c416b5b upstream.

As TOP_OF_KERNEL_STACK_PADDING was defined as 0 on x86_64, it went
unnoticed that the initialization of the .sp field in INIT_THREAD and some
calculations in the low level startup code do not take the padding into
account.

FRED enabled kernels require a 16 byte padding, which means that the init
task initialization and the low level startup code use the wrong stack
offset.

Subtract TOP_OF_KERNEL_STACK_PADDING in all affected places to adjust for
this.

Intel-SIG: commit c416b5b x86/fred: Fix init_task thread stack pointer
initialization
Backport FRED support.

Fixes: 65c9cc9 ("x86/fred: Reserve space for the FRED stack frame")
Fixes: 3adee77 ("x86/smpboot: Remove initial_stack on 64-bit")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit c416b5b)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit 989b5cf upstream.

Depending on whether FRED is enabled, sysvec_install() installs a system
interrupt handler into either into FRED's system vector dispatch table or
into the IDT.

However FRED can be disabled later in trap_init(), after sysvec_install()
has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is
registered with sysvec_install() in kvm_guest_init(), which is called in
setup_arch() but way before trap_init().

IOW, there is a gap between FRED is available and available but disabled.
As a result, when FRED is available but disabled, early sysvec_install()
invocations fail to install the IDT handler resulting in spurious
interrupts.

Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to
ensure that FRED is disabled before the first sysvec_install() incovations.
Intel-SIG: commit 989b5cf x86/fred: Parse cmdline param "fred=" in
cpu_parse_early_param()
Backport FRED support.

Fixes: 3810da1 ("x86/fred: Add a fred= cmdline param")
Reported-by: Hou Wenlong <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]

(cherry picked from commit 989b5cf)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit 73270c1 upstream.

To enable FRED earlier, move the RSP initialization out of
cpu_init_fred_exceptions() into cpu_init_fred_rsps().

This is required as the FRED RSP initialization depends on the availability
of the CPU entry areas which are set up late in trap_init(),

No functional change intended. Marked with Fixes as it's a depedency for
the real fix.

Intel-SIG: commit 73270c1 x86/fred: Move FRED RSP initialization into
separate function
Backport FRED support.

Fixes: 14619d9 ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
(cherry picked from commit 73270c1)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit a97756c upstream.

On 64-bit init_mem_mapping() relies on the minimal page fault handler
provided by the early IDT mechanism. The real page fault handler is
installed right afterwards into the IDT.

This is problematic on CPUs which have X86_FEATURE_FRED set because the
real page fault handler retrieves the faulting address from the FRED
exception stack frame and not from CR2, but that does obviously not work
when FRED is not yet enabled in the CPU.

To prevent this enable FRED right after init_mem_mapping() without
interrupt stacks. Those are enabled later in trap_init() after the CPU
entry area is set up.

[ tglx: Encapsulate the FRED details ]

Intel-SIG: commit a97756c x86/fred: Enable FRED right after
init_mem_mapping()
Backport FRED support.

Fixes: 14619d9 ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Hou Wenlong <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
(cherry picked from commit a97756c)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit 23edbd2ca5fb4c78ac4a5644511c63895fd1c57 upstream.

SS is initialized to NULL during boot time and not explicitly set to
__KERNEL_DS.

With FRED enabled, if a kernel event is delivered before a CPU goes to
user level for the first time, its SS is NULL thus NULL is pushed into
the SS field of the FRED stack frame.  But before ERETS is executed,
the CPU may context switch to another task and go to user level.  Then
when the CPU comes back to kernel mode, SS is changed to __KERNEL_DS.
Later when ERETS is executed to return from the kernel event handler,
a #GP fault is generated because SS doesn't match the SS saved in the
FRED stack frame.

Initialize SS to __KERNEL_DS when enabling FRED to prevent that.

Note, IRET doesn't check if SS matches the SS saved in its stack frame,
thus IDT doesn't have this problem.  For IDT it doesn't matter whether
SS is set to __KERNEL_DS or not, because it's set to NULL upon interrupt
or exception delivery and __KERNEL_DS upon SYSCALL.  Thus it's pointless
to initialize SS for IDT.

Intel-SIG: commit 723edbd x86/fred: Set SS to __KERNEL_DS when
enabling FRED
Backport FRED support.

Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]

(cherry picked from commit 723edbd)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit 0dfac6f upstream.

In most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
are zeros, e.g., 99% in kernel build tests.  So an obvious optimization is
to test ti_work for zero before processing individual bits in it.

Omit the optimization when FPU debugging is enabled, otherwise the
FPU consistency check is never executed.

Intel 0day tests did not find a perfermance regression with this change.

Intel-SIG: commit 0dfac6f x86/entry: Test ti_work for zero before
processing individual bits
Backport FRED support.

Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]

(cherry picked from commit 0dfac6f)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
…nism

commit efe5088 upstream.

Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
use the alternatives mechanism to choose WRMSRNS when it's available,
otherwise fallback to WRMSR.

Remove the dependency on X86_FEATURE_WRMSRNS as WRMSRNS is no longer
dependent on FRED.

[1] https://lore.kernel.org/lkml/[email protected]/

Use DS prefix to pad WRMSR instead of a NOP. The prefix is ignored. At
least that's the current information from the hardware folks.

Intel-SIG: commit efe5088 x86/msr: Switch between WRMSRNS and WRMSR
with the alternatives mechanism
Backport FRED support.

Signed-off-by: Andrew Cooper <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]

(cherry picked from commit efe5088)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
…itch

commit fe85ee3 upstream.

The FRED RSP0 MSR points to the top of the kernel stack for user level
event delivery. As this is the task stack it needs to be updated when a
task is scheduled in.

The update is done at context switch. That means it's also done when
switching to kernel threads, which is pointless as those never go out to
user space. For KVM threads this means there are two writes to FRED_RSP0 as
KVM has to switch to the guest value before VMENTER.

Defer the update to the exit to user space path and cache the per CPU
FRED_RSP0 value, so redundant writes can be avoided.

Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual
MSR value after returning from guest to host mode.

[ tglx: Massage change log ]

Intel-SIG: commit fe85ee3 x86/entry: Set FRED RSP0 on return to
userspace instead of context switch
Backport FRED support.

Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
(cherry picked from commit fe85ee3)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit de31b3cd706347044e1a57d68c3a683d58e8cca4 upstream.

The FRED RSP0 MSR is only used for delivering events when running
userspace.  Linux leverages this property to reduce expensive MSR
writes and optimize context switches.  The kernel only writes the
MSR when about to run userspace *and* when the MSR has actually
changed since the last time userspace ran.

This optimization is implemented by maintaining a per-CPU cache of
FRED RSP0 and then checking that against the value for the top of
current task stack before running userspace.

However cpu_init_fred_exceptions() writes the MSR without updating
the per-CPU cache.  This means that the kernel might return to
userspace with MSR_IA32_FRED_RSP0==0 when it needed to point to the
top of current task stack.  This would induce a double fault (#DF),
which is bad.

A context switch after cpu_init_fred_exceptions() can paper over
the issue since it updates the cached value.  That evidently
happens most of the time explaining how this bug got through.

Fix the bug through resynchronizing the FRED RSP0 MSR with its
per-CPU cache in cpu_init_fred_exceptions().

Intel-SIG: commit de31b3cd7063 x86/fred: Fix the FRED RSP0 MSR out
of sync with its per-CPU cache
Backport FRED support.

Fixes: fe85ee3 ("x86/entry: Set FRED RSP0 on return to userspace instead of context switch")
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc:[email protected]
Link: https://lore.kernel.org/all/20250110174639.1250829-1-xin%40zytor.com
(cherry picked from commit de31b3cd706347044e1a57d68c3a683d58e8cca4)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
commit e5f1e8af9c9e151ecd665f6d2e36fb25fec3b110 upstream.

Upon a wakeup from S4, the restore kernel starts and initializes the
FRED MSRs as needed from its perspective.  It then loads a hibernation
image, including the image kernel, and attempts to load image pages
directly into their original page frames used before hibernation unless
those frames are currently in use.  Once all pages are moved to their
original locations, it jumps to a "trampoline" page in the image kernel.

At this point, the image kernel takes control, but the FRED MSRs still
contain values set by the restore kernel, which may differ from those
set by the image kernel before hibernation.  Therefore, the image kernel
must ensure the FRED MSRs have the same values as before hibernation.
Since these values depend only on the location of the kernel text and
data, they can be recomputed from scratch.

Intel-SIG: commit e5f1e8af9c9e1 x86/fred: Fix system hang during S4
 resume with FRED enabled
Backport FRED support.

Reported-by: Xi Pardee <[email protected]>
Reported-by: Todd Brandt <[email protected]>
Tested-by: Todd Brandt <[email protected]>
Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit e5f1e8af9c9e151ecd665f6d2e36fb25fec3b110)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
…rn from SIGTRAP handler

commit e34dbbc85d64af59176fe59fad7b4122f4330fe2 upstream.

Clear the software event flag in the augmented SS to prevent immediate
repeat of single step trap on return from SIGTRAP handler if the trap
flag (TF) is set without an external debugger attached.

Following is a typical single-stepping flow for a user process:

1) The user process is prepared for single-stepping by setting
   RFLAGS.TF = 1.
2) When any instruction in user space completes, a #DB is triggered.
3) The kernel handles the #DB and returns to user space, invoking the
   SIGTRAP handler with RFLAGS.TF = 0.
4) After the SIGTRAP handler finishes, the user process performs a
   sigreturn syscall, restoring the original state, including
   RFLAGS.TF = 1.
5) Goto step 2.

According to the FRED specification:

A) Bit 17 in the augmented SS is designated as the software event
   flag, which is set to 1 for FRED event delivery of SYSCALL,
   SYSENTER, or INT n.
B) If bit 17 of the augmented SS is 1 and ERETU would result in
   RFLAGS.TF = 1, a single-step trap will be pending upon completion
   of ERETU.

In step 4) above, the software event flag is set upon the sigreturn
syscall, and its corresponding ERETU would restore RFLAGS.TF = 1.
This combination causes a pending single-step trap upon completion of
ERETU.  Therefore, another #DB is triggered before any user space
instruction is executed, which leads to an infinite loop in which the
SIGTRAP handler keeps being invoked on the same user space IP.

Intel-SIG: commit e34dbbc85d64a x86/fred/signal: Prevent immediate repeat
 of single step trap on return from SIGTRAP handler
Backport FRED support.

Fixes: 14619d9 ("x86/fred: FRED entry/exit and dispatch code")
Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Tested-by: Sohil Mehta <[email protected]>
Cc:[email protected]
Link: https://lore.kernel.org/all/20250609084054.2083189-2-xin%40zytor.com
(cherry picked from commit e34dbbc85d64af59176fe59fad7b4122f4330fe2)
[ Ethan Zhao: amend commit log ]
Signed-off-by: Ethan Zhao <[email protected]>
@bhe4 bhe4 force-pushed the 6.6-velinux_all branch 2 times, most recently from 10a6355 to b9b0c02 Compare October 29, 2025 13:00
@bhe4 bhe4 marked this pull request as draft October 30, 2025 02:42
The call to idxd_free() introduces a duplicate put_device() leading to a
reference count underflow:
refcount_t: underflow; use-after-free.
WARNING: CPU: 15 PID: 4428 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
...
Call Trace:
 <TASK>
  idxd_remove+0xe4/0x120 [idxd]
  pci_device_remove+0x3f/0xb0
  device_release_driver_internal+0x197/0x200
  driver_detach+0x48/0x90
  bus_remove_driver+0x74/0xf0
  pci_unregister_driver+0x2e/0xb0
  idxd_exit_module+0x34/0x7a0 [idxd]
  __do_sys_delete_module.constprop.0+0x183/0x280
  do_syscall_64+0x54/0xd70
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The idxd_unregister_devices() which is invoked at the very beginning of
idxd_remove(), already takes care of the necessary put_device() through the
following call path:
idxd_unregister_devices() -> device_unregister() -> put_device()

In addition, when CONFIG_DEBUG_KOBJECT_RELEASE is enabled, put_device() may
trigger asynchronous cleanup via schedule_delayed_work(). If idxd_free() is
called immediately after, it can result in a use-after-free.

Remove the improper idxd_free() to avoid both the refcount underflow and
potential memory corruption during module unload.

Fixes: d5449ff1b04d ("dmaengine: idxd: Add missing idxd cleanup to fix memory leak in remove call")
Signed-off-by: Yi Sun <[email protected]>
Tested-by: Shuai Xue <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Acked-by: Vinicius Costa Gomes <[email protected]>

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
ysun and others added 2 commits November 2, 2025 12:46
A recent refactor introduced a misplaced put_device() call, resulting in a
reference count underflow during module unload.

There is no need to add additional put_device() calls for idxd groups,
engines, or workqueues. Although the commit claims: "Note, this also
fixes the missing put_device() for idxd groups, engines, and wqs."

It appears no such omission actually existed. The required cleanup is
already handled by the call chain:
idxd_unregister_devices() -> device_unregister() -> put_device()

Extend idxd_cleanup() to handle the remaining necessary cleanup and
remove idxd_cleanup_internals(), which duplicates deallocation logic
for idxd, engines, groups, and workqueues. Memory management is also
properly handled through the Linux device model.

Fixes: a409e919ca32 ("dmaengine: idxd: Refactor remove call with idxd_cleanup() helper")
Signed-off-by: Yi Sun <[email protected]>
Tested-by: Shuai Xue <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Acked-by: Vinicius Costa Gomes <[email protected]>

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
The clean up in idxd_setup_wqs() has had a couple bugs because the error
handling is a bit subtle.  It's simpler to just re-write it in a cleaner
way.  The issues here are:

1) If "idxd->max_wqs" is <= 0 then we call put_device(conf_dev) when
   "conf_dev" hasn't been initialized.
2) If kzalloc_node() fails then again "conf_dev" is invalid.  It's
   either uninitialized or it points to the "conf_dev" from the
   previous iteration so it leads to a double free.

It's better to free partial loop iterations within the loop and then
the unwinding at the end can handle whole loop iterations.  I also
renamed the labels to describe what the goto does and not where the goto
was located.

Fixes: 3fd2f4bc010c ("dmaengine: idxd: fix memory leak in error handling path of idxd_setup_wqs")
Reported-by: Colin Ian King <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
Clearwater Forrest has same c-state residency counters like Sierra Forrest.
So this simply adds cpu model id for it.

Cc: Artem Bityutskiy <[email protected]>
Cc: Kan Liang <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>
Along with the introduction Perfmon v6, pmu counters could be
incontinuous, like fixed counters on CWF, only fixed counters 0-3 and
5-7 are supported, there is no fixed counter 4 on CWF. To accommodate
this change, archPerfmonExt CPUID (0x23) leaves are introduced to
enumerate the true-view of counters bitmap.

Current perf code already supports archPerfmonExt CPUID and uses
counters-bitmap to enumerate HW really supported counters, but
x86_pmu_show_pmu_cap() still only dumps the absolute counter number
instead of true-view bitmap, it's out-dated and may mislead readers.

So dump counters true-view bitmap in x86_pmu_show_pmu_cap() and
opportunistically change the dump sequence and words.

Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
beckwen and others added 3 commits November 5, 2025 11:20
…esctrl subsystem

commit 594902c986e269660302f09df9ec4bf1cf017b77 upstream.

In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain
structure representing a NUMA node relies on the cacheinfo interface
(rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map)
for monitoring. The L3 cache information of a SNC NUMA node determines
which domains are summed for the "top level" L3-scoped events.

rdt_mon_domain::ci is initialized using the first online CPU of a NUMA
node. When this CPU goes offline, its shared_cpu_map is cleared to contain
only the offline CPU itself. Subsequently, attempting to read counters
via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning
zero values for "top-level events" without any error indication.

Replace the cacheinfo references in struct rdt_mon_domain and struct
rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).

rdt_domain_hdr::cpu_mask contains the online CPUs associated with that
domain. When reading "top-level events", select a CPU from
rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine
valid CPUs for reading RMID counter via the MSR interface.

Considering all CPUs associated with the L3 cache improves the chances
of picking a housekeeping CPU on which the counter reading work can be
queued, avoiding an unnecessary IPI.

Fixes: 328ea68 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files")
Signed-off-by: Qinyun Tan <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/[email protected]
Signed-off-by: Kui Wen <[email protected]>
1. add CONFIG_INTEL_IFS=m
2. add CONFIG_DMATEST=m
3. add CONFIG_TCG_TPM=y
4. do below change for EDAC on 25ww43.4
CONFIG_EDAC=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_DECODE_MCE=y
CONFIG_ACPI_APEI_ERST_DEBUG=y
CONFIG_EDAC_IEH=m
4. do below change for power module
CONFIG_INTEL_TPMI=m
CONFIG_INTEL_VSEC=m
CONFIG_INTEL_RAPL_TPMI=m
CONFIG_INTEL_PMT_CLASS=m
CONFIG_INTEL_PMT_TELEMETRY=m
5. enable the fred
CONFIG_X86_FRED=y
6. enable the cet
CONFIG_X86_USER_SHADOW_STACK=y

Signed-off-by: Bo He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.