Skip to content

Conversation

@bhe4
Copy link
Contributor

@bhe4 bhe4 commented Oct 21, 2025

the PR is to add all below PR to one PR for test purpose, the PR #57 had build issues, so skip it:

#82
#82 opened yesterday by quanxianwang
#81
#81 opened 5 days ago by quanxianwang
#80
#80 opened 5 days ago by zhang-rui
#79
#79 opened last week by zhang-rui
#78
#78 opened 2 weeks ago by zhang-rui
#77
#77 opened 2 weeks ago by zhang-rui
#76
#76 opened 2 weeks ago by zhang-rui
#75
#75 opened 2 weeks ago by zhang-rui
#74
#74 opened 2 weeks ago by zhang-rui
#73
#73 opened 2 weeks ago by zhang-rui
#72
#72 opened 2 weeks ago by quanxianwang
#71
#71 opened 3 weeks ago by quanxianwang
#61
#61 opened on Sep 17 by x56Jason
#57
#57 opened on Jul 5 by jackYoung0915
#50
#50 opened on Jun 11 by PvsNarasimha
1
#47
#47 opened on May 9 by PvsNarasimha
#44
#44 opened on Mar 31 by zhiquan1-li
#40
#40 opened on Mar 27 by EthanZHF
#33
#33 opened on Mar 18 by EthanZHF

David Jeffery and others added 30 commits October 21, 2025 17:39
…w_queues

commit 8f5fea6 upstream.

When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.

To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.

Signed-off-by: David Jeffery <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Diangang Li <[email protected]>
commit f91f2a9 upstream.

A new DSA device ID, 0x11fb, is introduced for the Granite Rapids-D
platform. Add the device ID to the IDXD driver.

Since a potential security issue has been fixed on the new device, it's
secure to assign the device to virtual machines, and therefore, the new
device ID will not be added to the VFIO denylist. Additionally, the new
device ID may be useful in identifying and addressing any other potential
issues with this specific device in the future. The same is also applied
to any other new DSA/IAA devices with new device IDs.

Intel-SIG: commit f91f2a9 dmaengine: idxd: Add a new DSA device ID
for Granite Rapids-D platform
Add GNR idxd support.

Signed-off-by: Fenghua Yu <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
(cherry picked from commit f91f2a9)
Signed-off-by: Ethan Zhao <[email protected]>
commit 5f4d1fd upstream.

OpenSSL 3.0 deprecates some of the functions used in the SGX
selftests, causing build errors on new distros. For now ignore
the warnings until support for the functions is no longer
available and mark FIXME so that it can be clear this should
be removed at some point.

Intel-SIG: commit 5f4d1fd selftests/sgx: Ignore OpenSSL 3.0
deprecated functions warning
Backport some SGX bug fixes.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>
[ Zhiquan Li: amend commit log ]
Signed-off-by: Zhiquan Li <[email protected]>
commit ee56a28 upstream.

Modify the comments for sgx_encl_lookup_backing() and for
sgx_encl_alloc_backing() to indicate that they take a reference
which must be dropped with a call to sgx_encl_put_backing().
Make sgx_encl_lookup_backing() static for now, and change the
name of sgx_encl_get_backing() to __sgx_encl_get_backing() to
make it more clear that sgx_encl_get_backing() is an internal
function.

Intel-SIG: commit ee56a28 x86/sgx: Improve comments for
sgx_encl_lookup/alloc_backing()
Backport some SGX bug fixes.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
[ Zhiquan Li: amend commit log ]
Signed-off-by: Zhiquan Li <[email protected]>
commit 370839c upstream.

Short Version:

Allow enclaves to use the new Asynchronous EXit (AEX)
notification mechanism.  This mechanism lets enclaves run a
handler after an AEX event.  These handlers can run mitigations
for things like SGX-Step[1].

AEX Notify will be made available both on upcoming processors and
on some older processors through microcode updates.

Long Version:

== SGX Attribute Background ==

The SGX architecture includes a list of SGX "attributes".  These
attributes ensure consistency and transparency around specific
enclave features.

As a simple example, the "DEBUG" attribute allows an enclave to
be debugged, but also destroys virtually all of SGX security.
Using attributes, enclaves can know that they are being debugged.
Attributes also affect enclave attestation so an enclave can, for
instance, be denied access to secrets while it is being debugged.

The kernel keeps a list of known attributes and will only
initialize enclaves that use a known set of attributes.  This
kernel policy eliminates the chance that a new SGX attribute
could cause undesired effects.

For example, imagine a new attribute was added called
"PROVISIONKEY2" that provided similar functionality to
"PROVISIIONKEY".  A kernel policy that allowed indiscriminate use
of unknown attributes and thus PROVISIONKEY2 would undermine the
existing kernel policy which limits use of PROVISIONKEY enclaves.

== AEX Notify Background ==

"Intel Architecture Instruction Set Extensions and Future
Features - Version 45" is out[2].  There is a new chapter:

	Asynchronous Enclave Exit Notify and the EDECCSSA User Leaf Function.

Enclaves exit can be either synchronous and consensual (EEXIT for
instance) or asynchronous (on an interrupt or fault).  The
asynchronous ones can evidently be exploited to single step
enclaves[1], on top of which other naughty things can be built.

AEX Notify will be made available both on upcoming processors and
on some older processors through microcode updates.

== The Problem ==

These attacks are currently entirely opaque to the enclave since
the hardware does the save/restore under the covers. The
Asynchronous Enclave Exit Notify (AEX Notify) mechanism provides
enclaves an ability to detect and mitigate potential exposure to
these kinds of attacks.

== The Solution ==

Define the new attribute value for AEX Notification.  Ensure the
attribute is cleared from the list reserved attributes.  Instead
of adding to the open-coded lists of individual attributes,
add named lists of privileged (disallowed by default) and
unprivileged (allowed by default) attributes.  Add the AEX notify
attribute as an unprivileged attribute, which will keep the kernel
from rejecting enclaves with it set.

1. https://github.com/jovanbulck/sgx-step
2. https://cdrdv2.intel.com/v1/dl/getContent/671368?explicitVersion=true

Intel-SIG: commit 370839c x86/sgx: Allow enclaves to use
Asynchrounous Exit Notification
Backport some SGX bug fixes.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Jarkko Sakkinen <[email protected]>
Tested-by: Haitao Huang <[email protected]>
Tested-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/all/20220720191347.1343986-1-dave.hansen%40linux.intel.com
[ Zhiquan Li: amend commit log ]
Signed-off-by: Zhiquan Li <[email protected]>
commit 16a7fe3 upstream.

The new Asynchronous Exit (AEX) notification mechanism (AEX-notify)
allows one enclave to receive a notification in the ERESUME after the
enclave exit due to an AEX.  EDECCSSA is a new SGX user leaf function
(ENCLU[EDECCSSA]) to facilitate the AEX notification handling.  The new
EDECCSSA is enumerated via CPUID(EAX=0x12,ECX=0x0):EAX[11].

Besides Allowing reporting the new AEX-notify attribute to KVM guests,
also allow reporting the new EDECCSSA user leaf function to KVM guests
so the guest can fully utilize the AEX-notify mechanism.

Similar to existing X86_FEATURE_SGX1 and X86_FEATURE_SGX2, introduce a
new scattered X86_FEATURE_SGX_EDECCSSA bit for the new EDECCSSA, and
report it in KVM's supported CPUIDs.

Note, no additional KVM enabling is required to allow the guest to use
EDECCSSA.  It's impossible to trap ENCLU (without completely preventing
the guest from using SGX).  Advertise EDECCSSA as supported purely so
that userspace doesn't need to special case EDECCSSA, i.e. doesn't need
to manually check host CPUID.

The inability to trap ENCLU also means that KVM can't prevent the guest
from using EDECCSSA, but that virtualization hole is benign as far as
KVM is concerned.  EDECCSSA is simply a fancy way to modify internal
enclave state.

More background about how do AEX-notify and EDECCSSA work:

SGX maintains a Current State Save Area Frame (CSSA) for each enclave
thread.  When AEX happens, the enclave thread context is saved to the
CSSA and the CSSA is increased by 1.  For a normal ERESUME which doesn't
deliver AEX notification, it restores the saved thread context from the
previously saved SSA and decreases the CSSA.  If AEX-notify is enabled
for one enclave, the ERESUME acts differently.  Instead of restoring the
saved thread context and decreasing the CSSA, it acts like EENTER which
doesn't decrease the CSSA but establishes a clean slate thread context
using the CSSA for the enclave to handle the notification.  After some
handling, the enclave must discard the "new-established" SSA and switch
back to the previously saved SSA (upon AEX).  Otherwise, the enclave
will run out of SSA space upon further AEXs and eventually fail to run.

To solve this problem, the new EDECCSSA essentially decreases the CSSA.
It can be used by the enclave notification handler to switch back to the
previous saved SSA when needed, i.e. after it handles the notification.

Intel-SIG: commit 16a7fe3 KVM/VMX: Allow exposing EDECCSSA user
leaf function to KVM guest
Backport some SGX bug fixes.

Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Sean Christopherson <[email protected]>
Acked-by: Jarkko Sakkinen <[email protected]>
Link: https://lore.kernel.org/all/20221101022422.858944-1-kai.huang%40intel.com
[ Zhiquan Li: amend commit log ]
Signed-off-by: Zhiquan Li <[email protected]>
commit 49ff3b4 upstream.

On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.

For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.

This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.

Before:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]

After:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]

Fixes: a16eb25 ("KVM: x86: Mask LVTPC when handling a PMI")
Cc: [email protected]
Signed-off-by: Sandipan Das <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
[sean: use is_intel_compatible instead of !is_amd_or_hygon()]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit fd706c9 upstream.

Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel.  To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.

Note!  This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor.  One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.

KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior.  And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel.  And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.

Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior.  The Intel code will be
cleaned up in the future.

Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 25b9784 upstream.

Manually look for a CPUID.0x1 entry instead of bouncing through
kvm_cpuid() when retrieving the Family-Model-Stepping information for
vCPU RESET/INIT.  This fixes a potential undefined behavior bug due to
kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_,
a.k.a. the index.

A more minimal fix would be to simply zero "dummy", but the extra work in
kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as
an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR.
Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as
holding the CPU's FMS information, not as holding CPUID.0x1.EAX.  KVM's
usage of CPUID entries to get FMS is simply a pragmatic approach to avoid
having yet another way for userspace to provide inconsistent data.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 540c7ab upstream.

SDM section 18.2.3 mentioned that:

  "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of
   any general-purpose or fixed-function counters via a single WRMSR."

It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing,
the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr
MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop
global_ovf_ctrl variable.

Tested-by: Like Xu <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…able

commit 73cd107 upstream.

Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.

Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: PvsNarasimha <[email protected]>
commit e1bfc24 upstream.

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: PvsNarasimha <[email protected]>
commit b1d66da upstream.

For Intel, the guest PMU can be disabled via clearing the PMU CPUID.
For AMD, all hw implementations support the base set of four
performance counters, with current mainstream hardware indicating
the presence of two additional counters via X86_FEATURE_PERFCTR_CORE.

In the virtualized world, the AMD guest driver may detect
the presence of at least one counter MSR. Most hypervisor
vendors would introduce a module param (like lbrv for svm)
to disable PMU for all guests.

Another control proposal per-VM is to pass PMU disable information
via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00].
Both of methods require some guest-side changes, so a module
parameter may not be sufficiently granular, but practical enough.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 006a0f0 upstream.

Because IceLake has 4 fixed performance counters but KVM only
supports 3, it is possible for reprogram_fixed_counters to pass
to reprogram_fixed_counter an index that is out of bounds for the
fixed_pmc_events array.

Ultimately intel_find_fixed_event, which is the only place that uses
fixed_pmc_events, handles this correctly because it checks against the
size of fixed_pmc_events anyway.  Every other place operates on the
fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED.
However, it is cleaner if the unsupported performance counters are culled
early on in reprogram_fixed_counters.

Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 7618756 upstream.

The current pmc->eventsel for fixed counter is underutilised. The
pmc->eventsel can be setup for all known available fixed counters
since we have mapping between fixed pmc index and
the intel_arch_events array.

Either gp or fixed counter, it will simplify the later checks for
consistency between eventsel and perf_hw_id.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 6ed1298 upstream.

Since we set the same semantic event value for the fixed counter in
pmc->eventsel, returning the perf_hw_id for the fixed counter via
find_fixed_event() can be painlessly replaced by pmc_perf_hw_id()
with the help of pmc_is_fixed() check.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 40ccb96 upstream.

Depending on whether intr should be triggered or not, KVM registers
two different event overflow callbacks in the perf_event context.

The code skeleton of these two functions is very similar, so
the pmc->intr can be stored into pmc from pmc_reprogram_counter()
which provides smaller instructions footprint against the
u-architecture branch predictor.

The __kvm_perf_overflow() can be called in non-nmi contexts
and a flag is needed to distinguish the caller context and thus
avoid a check on kvm_is_in_guest(), otherwise we might get
warnings from suspicious RCU or check_preemption_disabled().

[Backport Changes]
- In commit b9f5621, kvm_is_in_guest() was changed
  to kvm_guest_state().
- In commit 73cd107, kvm_guest_state() was updated
  to kvm_handling_nmi_from_guest().
- In commit 40ccb96, kvm_is_in_guest() was removed,
  but instead of removing kvm_handling_nmi_from_guest(pmc->vcpu)
  was retained for compatibility
- This backported patch adds kvm_handling_nmi_from_guest(pmc->vcpu)
  instead of kvm_is_in_guest() for compatibility.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 9cd803d upstream.

When KVM retires a guest instruction through emulation, increment any
vPMCs that are configured to monitor "instructions retired," and
update the sample period of those counters so that they will overflow
at the right time.

Signed-off-by: Eric Hankland <[email protected]>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Added 'static' to kvm_pmu_incr_counter() definition.
  - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
    PERF_EVENT_STATE_ACTIVE.
]
Fixes: f5132b0 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: Jim Mattson <[email protected]>
[likexu:
  - Drop checks for pmc->perf_event or event state or event type
  - Increase a counter once its umask bits and the first 8 select bits are matched
  - Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf;
  - Rename kvm_pmu_record_event to kvm_pmu_trigger_event;
  - Add counter enable and CPL check for kvm_pmu_trigger_event();
]
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 405329f upstream.

Normally guests will set up CR3 themselves, but some guests, such as
kselftests, and potentially CONFIG_PVH guests, rely on being booted
with paging enabled and CR3 initialized to a pre-allocated page table.

Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
is too late, since it will have switched over to using the VMSA page
prior to that point, with the VMSA CR3 copied from the VMCB initial
CR3 value: 0.

Address this by sync'ing the CR3 value into the VMCB save area
immediately when KVM_SET_SREGS* is issued so it will find it's way into
the initial VMSA.

Suggested-by: Tom Lendacky <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
[Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
 new hook. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…tries

commit ee3a5f9 upstream.

kvm_update_cpuid_runtime() mangles CPUID data coming from userspace
VMM after updating 'vcpu->arch.cpuid_entries', this makes it
impossible to compare an update with what was previously
supplied. Introduce __kvm_update_cpuid_runtime() version which can be
used to tweak the input before it goes to 'vcpu->arch.cpuid_entries'
so the upcoming update check can compare tweaked data.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 4732f24 upstream.

The new module parameter to control PMU virtualization should apply
to Intel as well as AMD, for situations where userspace is not trusted.
If the module parameter allows PMU virtualization, there could be a
new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
PMU virtualization on a per-VM basis.

If the module parameter does not allow PMU virtualization, there
should be no userspace override, since we have no precedent for
authorizing that kind of override. If it's false, other counter-based
profiling features (such as LBR including the associated CPUID bits
if any) will not be exposed.

Change its name from "pmu" to "enable_pmu" as we have temporary
variables with the same name in our code like "struct kvm_pmu *pmu".

Fixes: b1d66da ("KVM: x86/svm: Add module param to control PMU virtualization")
Suggested-by : Jim Mattson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 7ff775a upstream.

The PMU event filter may contain up to 300 events. Replace the linear
search in reprogram_gp_counter() with a binary search.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit c3e8abf upstream.

Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.

No functional change intended.

[Backport Changes]
Definitions of pi_{pre, post}_block() were removed in the commit: d76fb40

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…runtime()

commit 5c89be1 upstream.

Full equality check of CPUID data on update (kvm_cpuid_check_equal()) may
fail for SGX enabled CPUs as CPUID.(EAX=0x12,ECX=1) is currently being
mangled in kvm_vcpu_after_set_cpuid(). Move it to
__kvm_update_cpuid_runtime() and split off cpuid_get_supported_xcr0()
helper  as 'vcpu->arch.guest_supported_xcr0' update needs (logically)
to stay in kvm_vcpu_after_set_cpuid().

Cc: [email protected]
Fixes: feb627e ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 2746a6b upstream.

Hypervisor leaves are always synthesized by __do_cpuid_func; just return
zeroes and do not ask the host.  Even on nested virtualization, a value
from another hypervisor would be bogus, because all hypercalls and MSRs
are processed by KVM.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit feee3d9 upstream.

Remove the export of kvm_x86_tlb_flush_current() as there are no longer
any users outside of common x86 code.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit e27bc04 upstream.

Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.

In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name.  Justification for using the VMX nomenclature:

  - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
    event that has already been "set" in e.g. the vIRR.  SVM's relevant
    VMCB field is even named event_inj, and KVM's stat is irq_injections.

  - prepare_guest_switch => prepare_switch_to_guest because the former is
    ambiguous, e.g. it could mean switching between multiple guests,
    switching from the guest to host, etc...

  - update_pi_irte => pi_update_irte to allow for matching match the rest
    of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().

  - start_assignment => pi_start_assignment to again follow VMX's posted
    interrupt naming scheme, and to provide context for what bit of code
    might care about an otherwise undescribed "assignment".

The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.

VMX and SVM function names are intentionally left as is to minimize the
diff.  Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 0bcd556 upstream.

Refactor the nested VMX PMU refresh helper to pass it a flag stating
whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX
helper query the information by bouncing through kvm_x86_ops.pmu_ops.
This will allow a future patch to use static_call() for the PMU ops
without having to export any static call definitions from common x86, and
it is also a step toward unexported kvm_x86_ops.

Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use
kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection
and would require exporting kvm_pmu_is_valid_msr().

Opportunistically rename the helper to keep line lengths somewhat
reasonable, and to better capture its high-level role.

No functional change intended.

Cc: Like Xu <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 03d004c upstream.

Use slightly more verbose names for the so called "memory encrypt",
a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
super short kvm_x86_ops names and SVM's more verbose, but non-conforming
names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
to fill svm_x86_ops.

Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
reflect its true nature, as it really is a full fledged ioctl() of its
own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
the ioctl() is a gateway to more than just memory encryption, and because
its underlying purpose to support Confidential VMs, which can be provided
without memory encryption, e.g. if the TCB of the guest includes the host
kernel but not host userspace, or by isolation in hardware without
encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
further is undeseriable, and short of creating alises for all related
ioctl()s, which introduces a different flavor of divergence, KVM is stuck
with the nomenclature.

Defer renaming SVM's functions to a future commit as there are additional
changes needed to make SVM fully conforming and to match reality (looking
at you, svm_vm_copy_asid_from()).

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 8a28978 upstream.

The two ioctls used to implement userspace-accelerated TPR,
KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available
even if hardware-accelerated TPR can be used.  So there is
no reason not to report KVM_CAP_VAPIC.

[Backport changes]
- In commit 58fccda, report_flexpriority() is renamed
to vmx_cpu_has_accelerated_tpr().

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
@bhe4 bhe4 force-pushed the 5.15-velinux_all_without_57 branch 3 times, most recently from a8f4fcf to 521296b Compare October 28, 2025 02:09
Kan Liang and others added 21 commits October 28, 2025 19:24
commit d4b5694 upstream.

From PMU's perspective, the SPR/GNR server has a similar uarch to the
ADL/MTL client p-core. Many functions are shared. However, the shared
function name uses the abbreviation of the server product code name,
rather than the common uarch code name.

Rename these internal shared functions by the common uarch name.

Intel-SIG: commit d4b5694 perf/x86/intel: Use the common uarch name for the shared functions.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 0ba0c03 upstream.

The SPR and ADL p-core have a similar uarch. Most of the initialization
code can be shared.

Factor out intel_pmu_init_glc() for the common initialization code.
The common part of the ADL p-core will be replaced by the later patch.

Intel-SIG: commit 0ba0c03 perf/x86/intel: Factor out the initialization code for SPR.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit d87d221 upstream.

From PMU's perspective, the ADL e-core and newer SRF/GRR have a similar
uarch. Most of the initialization code can be shared.

Factor out intel_pmu_init_grt() for the common initialization code.
The common part of the ADL e-core will be replaced by the later patch.

Intel-SIG: commit d87d221 perf/x86/intel: Factor out the initialization code for ADL e-core.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 299a5fc upstream.

Use the intel_pmu_init_glc() and intel_pmu_init_grt() to replace the
duplicate code for ADL.

The current code already checks the PERF_X86_EVENT_TOPDOWN flag before
invoking the Topdown metrics functions. (The PERF_X86_EVENT_TOPDOWN flag
is to indicate the Topdown metric feature, which is only available for
the p-core.) Drop the unnecessary adl_set_topdown_event_period() and
adl_update_topdown_event().

Intel-SIG: commit 299a5fc perf/x86/intel: Apply the common initialization code for ADL.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit b0560bf upstream.

There is a fairly long list of grievances about the current code. The
main beefs:

   1. hybrid_big_small assumes that the *HARDWARE* (CPUID) provided
      core types are a bitmap. They are not. If Intel happened to
      make a core type of 0xff, hilarity would ensue.
   2. adl_get_hybrid_cpu_type() utterly inscrutable.  There are
      precisely zero comments and zero changelog about what it is
      attempting to do.

According to Kan, the adl_get_hybrid_cpu_type() is there because some
Alder Lake (ADL) CPUs can do some silly things. Some ADL models are
*supposed* to be hybrid CPUs with big and little cores, but there are
some SKUs that only have big cores. CPUID(0x1a) on those CPUs does
not say that the CPUs are big cores. It apparently just returns 0x0.
It confuses perf because it expects to see either 0x40 (Core) or
0x20 (Atom).

The perf workaround for this is to watch for a CPU core saying it is
type 0x0. If that happens on an Alder Lake, it calls
x86_pmu.get_hybrid_cpu_type() and just assumes that the core is a
Core (0x40) CPU.

To fix up the mess, separate out the CPU types and the 'pmu' types.
This allows 'hybrid_pmu_type' bitmaps without worrying that some
future CPU type will set multiple bits.

Since the types are now separate, add a function to glue them back
together again. Actual comment on the situation in the glue
function (find_hybrid_pmu_for_cpu()).

Also, give ->get_hybrid_cpu_type() a real return type and make it
clear that it is overriding the *CPU* type, not the PMU type.

Rename cpu_type to pmu_type in the struct x86_hybrid_pmu to reflect the
change.

Originally-by: Dave Hansen <[email protected]>
Intel-SIG: commit b0560bf perf/x86/intel: Clean up the hybrid CPU type handling code.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 97588df upstream.

The current hybrid initialization codes aren't well organized and are
hard to read.

Factor out intel_pmu_init_hybrid() to do a common setup for each
hybrid PMU. The PMU-specific capability will be updated later via either
hard code (ADL) or CPUID hybrid enumeration (MTL).

Splitting the ADL and MTL initialization codes, since they have
different uarches. The hard code PMU capabilities are not required for
MTL either. They can be enumerated by the new leaf 0x23 and
IA32_PERF_CAPABILITIES MSR.

The hybrid enumeration of the IA32_PERF_CAPABILITIES MSR is broken on
MTL. Using the default value.

Intel-SIG: commit 97588df perf/x86/intel: Add common intel_pmu_init_hybrid().
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 950ecdc upstream.

Unnecessary multiplexing is triggered when running an "instructions"
event on an MTL.

perf stat -e cpu_core/instructions/,cpu_core/instructions/ -a sleep 1

 Performance counter stats for 'system wide':

       115,489,000      cpu_core/instructions/                (50.02%)
       127,433,777      cpu_core/instructions/                (49.98%)

       1.002294504 seconds time elapsed

Linux architectural perf events, e.g., cycles and instructions, usually
have dedicated fixed counters. These events also have equivalent events
which can be used in the general-purpose counters. The counters are
precious. In the intel_pmu_check_event_constraints(), perf check/extend
the event constraints of these events. So these events can utilize both
fixed counters and general-purpose counters.

The following cleanup commit:

  97588df ("perf/x86/intel: Add common intel_pmu_init_hybrid()")

forgot adding the intel_pmu_check_event_constraints() into update_pmu_cap().
The architectural perf events cannot utilize the general-purpose counters.

The code to check and update the counters, event constraints and
extra_regs is the same among hybrid systems. Move
intel_pmu_check_hybrid_pmus() to init_hybrid_pmu(), and
emove the duplicate check in update_pmu_cap().

Fixes: 97588df ("perf/x86/intel: Add common intel_pmu_init_hybrid()")
Intel-SIG: commit 950ecdc perf/x86/intel: Fix broken fixed event constraints extension.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit a23eb2f upstream.

The current perf assumes that the counters that support PEBS are
contiguous. But it's not guaranteed with the new leaf 0x23 introduced.
The counters are enumerated with a counter mask. There may be holes in
the counter mask for future platforms or in a virtualization
environment.

Store the PEBS event mask rather than the maximum number of PEBS
counters in the x86 PMU structures.

Intel-SIG: commit a23eb2f perf/x86/intel: Support the PEBS event mask.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 722e42e upstream.

The current perf assumes that both GP and fixed counters are contiguous.
But it's not guaranteed on newer Intel platforms or in a virtualization
environment.

Use the counter mask to replace the number of counters for both GP and
the fixed counters. For the other ARCHs or old platforms which don't
support a counter mask, using GENMASK_ULL(num_counter - 1, 0) to
replace. There is no functional change for them.

The interface to KVM is not changed. The number of counters still be
passed to KVM. It can be updated later separately.

Intel-SIG: commit 722e42e perf/x86: Support counter mask.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit a932aa0 upstream.

From PMU's perspective, Lunar Lake and Arrow Lake are similar to the
previous generation Meteor Lake. Both are hybrid platforms, with e-core
and p-core.

The key differences include:
- The e-core supports 3 new fixed counters
- The p-core supports an updated PEBS Data Source format
- More GP counters (Updated event constraint table)
- New Architectural performance monitoring V6
  (New Perfmon MSRs aliasing, umask2, eq).
- New PEBS format V6 (Counters Snapshotting group)
- New RDPMC metrics clear mode

The legacy features, the 3 new fixed counters and updated event
constraint table are enabled in this patch.

The new PEBS data source format, the architectural performance
monitoring V6, the PEBS format V6, and the new RDPMC metrics clear mode
are supported in the following patches.

Intel-SIG: commit a932aa0 perf/x86: Add Lunar Lake and Arrow Lake support.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 0902624 upstream.

The model-specific pebs_latency_data functions of ADL and MTL use the
"small" as a postfix to indicate the e-core. The postfix is too generic
for a model-specific function. It cannot provide useful information that
can directly map it to a specific uarch, which can facilitate the
development and maintenance.
Use the abbr of the uarch to rename the model-specific functions.

Intel-SIG: commit 0902624 perf/x86/intel: Rename model-specific pebs_latency_data functions.
CWF PMU core backporting

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 608f697 upstream.

A new PEBS data source format is introduced for the p-core of Lunar
Lake. The data source field is extended to 8 bits with new encodings.

A new layout is introduced into the union intel_x86_pebs_dse.
Introduce the lnl_latency_data() to parse the new format.
Enlarge the pebs_data_source[] accordingly to include new encodings.

Only the mem load and the mem store events can generate the data source.
Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and
INTEL_HYBRID_STLAT_CONSTRAINT to mark them.

Add two new bits for the new cache-related data src, L2_MHB and MSC.
The L2_MHB is short for L2 Miss Handling Buffer, which is similar to
LFB (Line Fill Buffer), but to track the L2 Cache misses.
The MSC stands for the memory-side cache.

Intel-SIG: commit 608f697 perf/x86/intel: Support new data source for Lunar Lake.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit e8fb5d6 upstream.

Different vendors may support different fields in EVENTSEL MSR, such as
Intel would introduce new fields umask2 and eq bits in EVENTSEL MSR
since Perfmon version 6. However, a fixed mask X86_RAW_EVENT_MASK is
used to filter the attr.config.

Introduce a new config_mask to record the real supported EVENTSEL
bitmask.
Only apply it to the existing code now. No functional change.

Co-developed-by: Dapeng Mi <[email protected]>
Intel-SIG: commit e8fb5d6 perf/x86: Add config_mask to represent EVENTSEL bitmask.
CWF PMU core backporting

Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit dce0c74 upstream.

Two new fields (the unit mask2, and the equal flag) are added in the
IA32_PERFEVTSELx MSRs. They can be enumerated by the CPUID.23H.0.EBX.

Update the config_mask in x86_pmu and x86_hybrid_pmu for the true layout
of the PERFEVTSEL.
Expose the new formats into sysfs if they are available. The umask
extension reuses the same format attr name "umask" as the previous
umask. Add umask2_show to determine/display the correct format
for the current machine.

Co-developed-by: Dapeng Mi <[email protected]>
Intel-SIG: commit dce0c74 perf/x86/intel: Support PERFEVTSEL extension.
CWF PMU core backporting

Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 149fd47 upstream.

The architectural performance monitoring V6 supports a new range of
counters' MSRs in the 19xxH address range. They include all the GP
counter MSRs, the GP control MSRs, and the fixed counter MSRs.

The step between each sibling counter is 4. Add intel_pmu_addr_offset()
to calculate the correct offset.

Add fixedctr in struct x86_pmu to store the address of the fixed counter
0. It can be used to calculate the rest of the fixed counters.

The MSR address of the fixed counter control is not changed.

Intel-SIG: commit 149fd47 perf/x86/intel: Support Perfmon MSRs aliasing.
CWF PMU core backporting

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 48d66c89dce1e3687174608a5f5c31d5961a9916 upstream.

From the PMU's perspective, Clearwater Forest is similar to the previous
generation Sierra Forest.

The key differences are the ARCH PEBS feature and the new added 3 fixed
counters for topdown L1 metrics events.

The ARCH PEBS is supported in the following patches. This patch provides
support for basic perfmon features and 3 new added fixed counters.

Intel-SIG: commit 48d66c89dce1 perf/x86/intel: Add PMU support for Clearwater Forest.
CWF PMU core backporting

Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 25c623f41438fafc6f63c45e2e141d7bcff78299 upstream.

CPUID archPerfmonExt (0x23) leaves are supported to enumerate CPU
level's PMU capabilities on non-hybrid processors as well.

This patch supports to parse archPerfmonExt leaves on non-hybrid
processors. Architectural PEBS leverages archPerfmonExt sub-leaves 0x4
and 0x5 to enumerate the PEBS capabilities as well. This patch is a
precursor of the subsequent arch-PEBS enabling patches.

Intel-SIG: commit 25c623f41438 perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs.
CWF PMU core backporting

Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
commit 4a3fd13054a98c43dfcfcbdb93deb43c7b1b9c34 upstream.

Arch-PEBS retires IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs, so
intel_pmu_pebs_enable/disable() and intel_pmu_pebs_enable/disable_all()
are not needed to call for ach-PEBS.

To make the code cleaner, introduce static calls
x86_pmu_pebs_enable/disable() and x86_pmu_pebs_enable/disable_all()
instead of adding "x86_pmu.arch_pebs" check directly in these helpers.

Intel-SIG: commit 4a3fd13054a9 perf/x86/intel: Introduce pairs of PEBS static calls.
CWF PMU core backporting

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[ Quanxian Wang: amend commit log ]
Signed-off-by: Quanxian Wang <[email protected]>
…esctrl subsystem

commit 594902c986e269660302f09df9ec4bf1cf017b77 upstream.

In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain
structure representing a NUMA node relies on the cacheinfo interface
(rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map)
for monitoring. The L3 cache information of a SNC NUMA node determines
which domains are summed for the "top level" L3-scoped events.

rdt_mon_domain::ci is initialized using the first online CPU of a NUMA
node. When this CPU goes offline, its shared_cpu_map is cleared to contain
only the offline CPU itself. Subsequently, attempting to read counters
via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning
zero values for "top-level events" without any error indication.

Replace the cacheinfo references in struct rdt_mon_domain and struct
rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).

rdt_domain_hdr::cpu_mask contains the online CPUs associated with that
domain. When reading "top-level events", select a CPU from
rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine
valid CPUs for reading RMID counter via the MSR interface.

Considering all CPUs associated with the L3 cache improves the chances
of picking a housekeeping CPU on which the counter reading work can be
queued, avoiding an unnecessary IPI.

Fixes: 328ea68 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files")
Signed-off-by: Qinyun Tan <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/[email protected]
Signed-off-by: Kui Wen <[email protected]>
Add missing clockticks and cas_count_* events for Intel SapphireRapids IMC
PMU. These events are useful to measure memory bandwidth.

Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
@bhe4 bhe4 force-pushed the 5.15-velinux_all_without_57 branch from 521296b to 7182e58 Compare October 28, 2025 11:24
@bhe4 bhe4 marked this pull request as draft October 30, 2025 02:42
Dapeng Mi and others added 3 commits November 5, 2025 15:57
Along with the introduction Perfmon v6, pmu counters could be
incontinuous, like fixed counters on CWF, only fixed counters 0-3 and
5-7 are supported, there is no fixed counter 4 on CWF. To accommodate
this change, archPerfmonExt CPUID (0x23) leaves are introduced to
enumerate the true-view of counters bitmap.

Current perf code already supports archPerfmonExt CPUID and uses
counters-bitmap to enumerate HW really supported counters, but
x86_pmu_show_pmu_cap() still only dumps the absolute counter number
instead of true-view bitmap, it's out-dated and may mislead readers.

So dump counters true-view bitmap in x86_pmu_show_pmu_cap() and
opportunistically change the dump sequence and words.

Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Clearwater Forrest has same c-state residency counters like Sierra Forrest.
So this simply adds cpu model id for it.

Cc: Artem Bityutskiy <[email protected]>
Cc: Kan Liang <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>
1. add CONFIG_INTEL_IFS=m
2. add CONFIG_DMATEST=m
3. add CONFIG_TCG_TPM=y
4. do below change for EDAC on 25ww43.4
CONFIG_EDAC=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_DECODE_MCE=y
CONFIG_ACPI_APEI_ERST_DEBUG=y
CONFIG_EDAC_IEH=m
4. do below change for power module
CONFIG_INTEL_TPMI=m
CONFIG_INTEL_VSEC=m
CONFIG_INTEL_RAPL_TPMI=m
CONFIG_INTEL_PMT_CLASS=m
CONFIG_INTEL_PMT_TELEMETRY=m

Signed-off-by: Bo He <[email protected]>
@bhe4 bhe4 force-pushed the 5.15-velinux_all_without_57 branch from 7182e58 to 755137d Compare November 5, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.