Skip to content

Conversation

@PvsNarasimha
Copy link

@PvsNarasimha PvsNarasimha commented May 9, 2025

Hardware Feature Enablement / Support

These patches add support for new CPU features, CPUID leaves, and hardware capabilities on Intel and AMD platforms.

  • x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled
  • KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022
  • KVM: x86/svm/pmu: Add AMD PerfMonV2 support
  • x86/cpu: Support AMD Automatic IBRS
  • x86/cpu, kvm: Add the Null Selector Clears Base feature
  • KVM: x86: add support for CPUID leaf 0x80000021
  • KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
  • perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
  • KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
  • KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter
  • KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests

Performance Improvements

Optimizations to reduce overhead and improve runtime efficiency:

  • KVM: x86/pmu: Rewrite reprogram_counters() to improve performance
  • KVM: x86: Use static calls to reduce kvm_pmu_ops overhead
  • KVM: x86: Copy kvm_pmu_ops by value to eliminate layer of indirection
  • KVM: x86/pmu: Use binary search to check filtered events
  • KVM: x86/pmu: Avoid using PEBS perf_events for normal counters

Refactoring / Code Reorganization

These changes focus on improving maintainability, readability, and structure of the KVM/x86 codebase:

  • KVM: x86/cpuid: Refactor host/guest CPU model consistency check
  • KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
  • KVM: VMX: Refactor intel_pmu_{g,}set_msr() to align with other helpers
  • KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code
  • KVM: nVMX: Refactor PMU refresh to avoid referencing kvm_x86_ops.pmu_ops
  • KVM: x86: Move guts of kvm_arch_init() to standalone helper
  • KVM: x86/pmu: Move handling PERF_GLOBAL_CTRL and friends to common x86
  • Move various helpers (e.g., pmc_perf_hw_id())
  • KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks

Feature Enhancements (vPMU / CPUID / MSRs / etc.)

These provide new capabilities, configurable options, and fine-tuned control to user space or the guest:

  • KVM: x86: Provide per VM capability for disabling PMU virtualization
  • KVM: x86/svm: Add module param to control PMU virtualization
  • KVM: x86/pmu: Restrict advanced features based on module enable_pmu
  • KVM: x86/pmu: Advertise PERFCTR_CORE iff the min nr of counters is met
  • KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
  • KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
  • KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
  • KVM: x86: Move lookup of indexed CPUID leafs to helper

User/Guest Safety / Fault Isolation

These prevent invalid or insecure guest/host interactions:

  • KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
  • KVM: x86/pmu: Reject userspace attempts to set reserved GLOBAL_STATUS bits
  • KVM: x86/pmu: Prevent the PMU from counting disallowed events
  • KVM: x86/pmu: Limit the maximum number of supported AMD GP counters
  • KVM: x86/pmu: Limit the maximum number of supported Intel GP counters
  • KVM: x86/pmu: WARN and bug the VM if PMU is refreshed after vCPU has run

Bug Fixes

These patches address correctness issues, warnings, and reliability concerns in KVM and vPMU:

  • KVM: x86: Fix errant brace in KVM capability handling
  • KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
  • KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
  • KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
  • KVM: x86: Fix pointer mistmatch warning when patching RET0 static calls
  • KVM: x86: Fix clang -Wimplicit-fallthrough in do_host_cpuid()
  • KVM: x86: avoid out of bounds indices for fixed performance counters
  • KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
  • kvm: x86/pmu: Fix the compare function used by the pmu event filter

Documentation or Cleanup / Misc

Cleanup and changes that make the codebase more developer-friendly:

  • docs: kvm: x86: Fix broken field list
  • KVM: x86/pmu: Rename global_ovf_ctrl_mask to global_status_mask
  • KVM: x86/pmu: Rename pmc_is_enabled() to pmc_is_globally_enabled()
  • KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
  • KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
  • KVM: x86: Move CPUID.(EAX=0x12,ECX=1) mangling to __kvm_update_cpuid_runtime()
  • KVM: x86: use static_call_cond for optional callbacks

Run Test cases

$ git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
$ cd kvm-unit-tests/
$ ./configure
$ make
$ ./run_tests.sh
root@volcano9b5e-os:/home/amd/Linux_Backport/kvm-unit-tests# ./run_tests.sh
PASS apic-split (56 tests)
PASS ioapic-split (19 tests)
PASS x2apic (56 tests)
FAIL xapic (timeout; duration=60)
PASS ioapic (26 tests)
SKIP cmpxchg8b (i386 only)
PASS smptest (1 tests)
PASS smptest3 (1 tests)
PASS vmexit_cpuid
PASS vmexit_vmcall
PASS vmexit_mov_from_cr8
PASS vmexit_mov_to_cr8
PASS vmexit_inl_pmtimer
PASS vmexit_ipi
PASS vmexit_ipi_halt
PASS vmexit_ple_round_robin
PASS vmexit_tscdeadline
PASS vmexit_tscdeadline_immed
PASS vmexit_cr0_wp
PASS vmexit_cr4_pge
PASS access (2 tests)
SKIP access_fep (test marked as manual run only)
SKIP access-reduced-maxphyaddr (/sys/module/kvm_intel/parameters/allow_smaller_maxphyaddr not equal to Y)
PASS smap (18 tests)
PASS pku (7 tests)
SKIP pks (0 tests)
PASS asyncpf (2 tests, 1 skipped)
PASS emulator (140 tests, 2 skipped)
PASS eventinj (13 tests)
PASS hypercall (2 tests)
PASS idt_test (4 tests)
PASS memory (7 tests, 1 skipped)
PASS msr (1836 tests)
SKIP pmu (/proc/sys/kernel/nmi_watchdog not equal to 0)
SKIP pmu_lbr (/proc/sys/kernel/nmi_watchdog not equal to 0)
SKIP pmu_pebs (/proc/sys/kernel/nmi_watchdog not equal to 0)
SKIP vmware_backdoors (/sys/module/kvm/parameters/enable_vmware_backdoor not equal to Y)
PASS realmode
PASS s3
PASS setjmp (10 tests)
PASS sieve
PASS syscall (2 tests)
PASS tsc (6 tests)
PASS tsc_adjust (6 tests)
PASS xsave (17 tests)
PASS rmap_chain
FAIL svm
SKIP svm_pause_filter (1 tests, 1 skipped)
PASS svm_npt (103 tests)
SKIP taskswitch (i386 only)
SKIP taskswitch2 (i386 only)
PASS kvmclock_test
PASS pcid-enabled (2 tests)
PASS pcid-disabled (2 tests)
PASS pcid-asymmetric (2 tests)
PASS rdpru (1 tests)
PASS umip (21 tests)
SKIP la57 (i386 only)
SKIP vmx (0 tests)
SKIP ept (0 tests)
SKIP vmx_eoi_bitmap_ioapic_scan (0 tests)
SKIP vmx_hlt_with_rvi_test (0 tests)
SKIP vmx_apicv_test (0 tests)
SKIP vmx_posted_intr_test (0 tests)
SKIP vmx_apic_passthrough_thread (0 tests)
SKIP vmx_init_signal_test (0 tests)
SKIP vmx_sipi_signal_test (0 tests)
SKIP vmx_apic_passthrough_tpr_threshold_test (0 tests)
SKIP vmx_vmcs_shadow_test (0 tests)
SKIP vmx_pf_exception_test (0 tests)
SKIP vmx_pf_exception_test_fep (test marked as manual run only)
SKIP vmx_pf_vpid_test (test marked as manual run only)
SKIP vmx_pf_invvpid_test (test marked as manual run only)
SKIP vmx_pf_no_vpid_test (test marked as manual run only)
SKIP vmx_pf_exception_test_reduced_maxphyaddr (/sys/module/kvm_intel/parameters/allow_smaller_maxphyaddr not equal to Y)
PASS debug (23 tests)
PASS hyperv_synic (1 tests)
PASS hyperv_connections (7 tests)
PASS hyperv_stimer (12 tests)
PASS hyperv_stimer_direct (8 tests)
PASS hyperv_clock (3 tests)
PASS intel_iommu (11 tests)
SKIP tsx-ctrl (1 tests, 1 skipped)
SKIP intel_cet (0 tests)
  • Verified via debug logs
root@volcano9dee-host:/home/amd# dmesg | grep Malathi 
[  29.605656] amd_pmu_vs_enable_all is called : Malathi 
[  29.607491] amd_pmu_ _v2_disable_all is called : Malathi
[  29.618834] amd_pmu_vs_disable_all is called : Malathi 
[  29.619491] amd_pmu_v2_enable_all is called : Malathi 
[  29.619530] amd_pmu_vs_disable_all is called : Malathi 
[  29.620491] amd_pmu_v2_enable_all is called : Malathi 
[  29.620493] amd_pmu_vs_disable_all is called : Malathi 
[  29.621491] amd_pmu_v2_enable_all is called : Malathi 
[  29.621491] amd_pmu_vs_disable_all is called : Malathi 
[  29.621491] amd_pmu_v2_enable_all is called : Malathi 
[  29.624441] amd_pmu_vs_disable_all is called : Malathi
  • Validated PMU with perf
root@volcano9dee-host:/home/amd/upstream_work/original_kernel_velinux/kernel/tools/perf# ./perf stat -e cycles,instructions a.out sleep 3
Hello,World
 
Performance counter stats for 'a.out sleep 3':
 
          5,74,208      cycles
          6,08,549      instructions              #    1.06  insn per cycle
 
       0.008310204 seconds time elapsed
 
       0.000000000 seconds user
       0.000797000 seconds sys
root@volcano9dee-host:/home/amd/upstream_work/original_kernel_velinux/kernel/tools/perf# ./perf record
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.273 MB perf.data (1560 samples) ]
./perf report | grep pmu
     0.11%  swapper          [kernel.kallsyms]         [k] x86_pmu_disable
     0.02%  perf             [kernel.kallsyms]         [k] amd_pmu_v2_enable_all
  • Booted VM with QEMU
vi vmc_script.sh //Copy the below script

#!/bin/bash

qemu-system-x86_64 \
 -m 2048 \
 -enable-kvm \
 -cpu host \
 -smp 2 \
 -hda /vms/images/velinux_chaithu.qcow2 \
 -cdrom /vms/iso/velinux-2.1-amd64-DVD-1.iso \
 -boot d \
 -netdev user,id=net0,hostfwd=tcp::2222-:22 \
 -device virtio-net-pci,netdev=net0 \
 -vnc :0

Run the script to create a VM

chmod +x vmc_script.sh
./vmc_script.sh

sandip4n and others added 30 commits May 9, 2025 15:32
commit 49ff3b4 upstream.

On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.

For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.

This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.

Before:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]

After:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]

Fixes: a16eb25 ("KVM: x86: Mask LVTPC when handling a PMI")
Cc: [email protected]
Signed-off-by: Sandipan Das <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
[sean: use is_intel_compatible instead of !is_amd_or_hygon()]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit fd706c9 upstream.

Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel.  To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.

Note!  This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor.  One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.

KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior.  And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel.  And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.

Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior.  The Intel code will be
cleaned up in the future.

Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 25b9784 upstream.

Manually look for a CPUID.0x1 entry instead of bouncing through
kvm_cpuid() when retrieving the Family-Model-Stepping information for
vCPU RESET/INIT.  This fixes a potential undefined behavior bug due to
kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_,
a.k.a. the index.

A more minimal fix would be to simply zero "dummy", but the extra work in
kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as
an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR.
Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as
holding the CPU's FMS information, not as holding CPUID.0x1.EAX.  KVM's
usage of CPUID entries to get FMS is simply a pragmatic approach to avoid
having yet another way for userspace to provide inconsistent data.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 540c7ab upstream.

SDM section 18.2.3 mentioned that:

  "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of
   any general-purpose or fixed-function counters via a single WRMSR."

It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing,
the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr
MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop
global_ovf_ctrl variable.

Tested-by: Like Xu <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…able

commit 73cd107 upstream.

Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.

Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: PvsNarasimha <[email protected]>
commit e1bfc24 upstream.

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: PvsNarasimha <[email protected]>
commit b1d66da upstream.

For Intel, the guest PMU can be disabled via clearing the PMU CPUID.
For AMD, all hw implementations support the base set of four
performance counters, with current mainstream hardware indicating
the presence of two additional counters via X86_FEATURE_PERFCTR_CORE.

In the virtualized world, the AMD guest driver may detect
the presence of at least one counter MSR. Most hypervisor
vendors would introduce a module param (like lbrv for svm)
to disable PMU for all guests.

Another control proposal per-VM is to pass PMU disable information
via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00].
Both of methods require some guest-side changes, so a module
parameter may not be sufficiently granular, but practical enough.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 006a0f0 upstream.

Because IceLake has 4 fixed performance counters but KVM only
supports 3, it is possible for reprogram_fixed_counters to pass
to reprogram_fixed_counter an index that is out of bounds for the
fixed_pmc_events array.

Ultimately intel_find_fixed_event, which is the only place that uses
fixed_pmc_events, handles this correctly because it checks against the
size of fixed_pmc_events anyway.  Every other place operates on the
fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED.
However, it is cleaner if the unsupported performance counters are culled
early on in reprogram_fixed_counters.

Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 7618756 upstream.

The current pmc->eventsel for fixed counter is underutilised. The
pmc->eventsel can be setup for all known available fixed counters
since we have mapping between fixed pmc index and
the intel_arch_events array.

Either gp or fixed counter, it will simplify the later checks for
consistency between eventsel and perf_hw_id.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 6ed1298 upstream.

Since we set the same semantic event value for the fixed counter in
pmc->eventsel, returning the perf_hw_id for the fixed counter via
find_fixed_event() can be painlessly replaced by pmc_perf_hw_id()
with the help of pmc_is_fixed() check.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 40ccb96 upstream.

Depending on whether intr should be triggered or not, KVM registers
two different event overflow callbacks in the perf_event context.

The code skeleton of these two functions is very similar, so
the pmc->intr can be stored into pmc from pmc_reprogram_counter()
which provides smaller instructions footprint against the
u-architecture branch predictor.

The __kvm_perf_overflow() can be called in non-nmi contexts
and a flag is needed to distinguish the caller context and thus
avoid a check on kvm_is_in_guest(), otherwise we might get
warnings from suspicious RCU or check_preemption_disabled().

[Backport Changes]
- In commit b9f5621, kvm_is_in_guest() was changed
  to kvm_guest_state().
- In commit 73cd107, kvm_guest_state() was updated
  to kvm_handling_nmi_from_guest().
- In commit 40ccb96, kvm_is_in_guest() was removed,
  but instead of removing kvm_handling_nmi_from_guest(pmc->vcpu)
  was retained for compatibility
- This backported patch adds kvm_handling_nmi_from_guest(pmc->vcpu)
  instead of kvm_is_in_guest() for compatibility.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 9cd803d upstream.

When KVM retires a guest instruction through emulation, increment any
vPMCs that are configured to monitor "instructions retired," and
update the sample period of those counters so that they will overflow
at the right time.

Signed-off-by: Eric Hankland <[email protected]>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Added 'static' to kvm_pmu_incr_counter() definition.
  - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
    PERF_EVENT_STATE_ACTIVE.
]
Fixes: f5132b0 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: Jim Mattson <[email protected]>
[likexu:
  - Drop checks for pmc->perf_event or event state or event type
  - Increase a counter once its umask bits and the first 8 select bits are matched
  - Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf;
  - Rename kvm_pmu_record_event to kvm_pmu_trigger_event;
  - Add counter enable and CPL check for kvm_pmu_trigger_event();
]
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 405329f upstream.

Normally guests will set up CR3 themselves, but some guests, such as
kselftests, and potentially CONFIG_PVH guests, rely on being booted
with paging enabled and CR3 initialized to a pre-allocated page table.

Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
is too late, since it will have switched over to using the VMSA page
prior to that point, with the VMSA CR3 copied from the VMCB initial
CR3 value: 0.

Address this by sync'ing the CR3 value into the VMCB save area
immediately when KVM_SET_SREGS* is issued so it will find it's way into
the initial VMSA.

Suggested-by: Tom Lendacky <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
[Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
 new hook. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…tries

commit ee3a5f9 upstream.

kvm_update_cpuid_runtime() mangles CPUID data coming from userspace
VMM after updating 'vcpu->arch.cpuid_entries', this makes it
impossible to compare an update with what was previously
supplied. Introduce __kvm_update_cpuid_runtime() version which can be
used to tweak the input before it goes to 'vcpu->arch.cpuid_entries'
so the upcoming update check can compare tweaked data.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 4732f24 upstream.

The new module parameter to control PMU virtualization should apply
to Intel as well as AMD, for situations where userspace is not trusted.
If the module parameter allows PMU virtualization, there could be a
new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
PMU virtualization on a per-VM basis.

If the module parameter does not allow PMU virtualization, there
should be no userspace override, since we have no precedent for
authorizing that kind of override. If it's false, other counter-based
profiling features (such as LBR including the associated CPUID bits
if any) will not be exposed.

Change its name from "pmu" to "enable_pmu" as we have temporary
variables with the same name in our code like "struct kvm_pmu *pmu".

Fixes: b1d66da ("KVM: x86/svm: Add module param to control PMU virtualization")
Suggested-by : Jim Mattson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 7ff775a upstream.

The PMU event filter may contain up to 300 events. Replace the linear
search in reprogram_gp_counter() with a binary search.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit c3e8abf upstream.

Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.

No functional change intended.

[Backport Changes]
Definitions of pi_{pre, post}_block() were removed in the commit: d76fb40

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…runtime()

commit 5c89be1 upstream.

Full equality check of CPUID data on update (kvm_cpuid_check_equal()) may
fail for SGX enabled CPUs as CPUID.(EAX=0x12,ECX=1) is currently being
mangled in kvm_vcpu_after_set_cpuid(). Move it to
__kvm_update_cpuid_runtime() and split off cpuid_get_supported_xcr0()
helper  as 'vcpu->arch.guest_supported_xcr0' update needs (logically)
to stay in kvm_vcpu_after_set_cpuid().

Cc: [email protected]
Fixes: feb627e ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 2746a6b upstream.

Hypervisor leaves are always synthesized by __do_cpuid_func; just return
zeroes and do not ask the host.  Even on nested virtualization, a value
from another hypervisor would be bogus, because all hypercalls and MSRs
are processed by KVM.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit feee3d9 upstream.

Remove the export of kvm_x86_tlb_flush_current() as there are no longer
any users outside of common x86 code.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit e27bc04 upstream.

Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.

In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name.  Justification for using the VMX nomenclature:

  - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
    event that has already been "set" in e.g. the vIRR.  SVM's relevant
    VMCB field is even named event_inj, and KVM's stat is irq_injections.

  - prepare_guest_switch => prepare_switch_to_guest because the former is
    ambiguous, e.g. it could mean switching between multiple guests,
    switching from the guest to host, etc...

  - update_pi_irte => pi_update_irte to allow for matching match the rest
    of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().

  - start_assignment => pi_start_assignment to again follow VMX's posted
    interrupt naming scheme, and to provide context for what bit of code
    might care about an otherwise undescribed "assignment".

The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.

VMX and SVM function names are intentionally left as is to minimize the
diff.  Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 0bcd556 upstream.

Refactor the nested VMX PMU refresh helper to pass it a flag stating
whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX
helper query the information by bouncing through kvm_x86_ops.pmu_ops.
This will allow a future patch to use static_call() for the PMU ops
without having to export any static call definitions from common x86, and
it is also a step toward unexported kvm_x86_ops.

Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use
kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection
and would require exporting kvm_pmu_is_valid_msr().

Opportunistically rename the helper to keep line lengths somewhat
reasonable, and to better capture its high-level role.

No functional change intended.

Cc: Like Xu <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 03d004c upstream.

Use slightly more verbose names for the so called "memory encrypt",
a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
super short kvm_x86_ops names and SVM's more verbose, but non-conforming
names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
to fill svm_x86_ops.

Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
reflect its true nature, as it really is a full fledged ioctl() of its
own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
the ioctl() is a gateway to more than just memory encryption, and because
its underlying purpose to support Confidential VMs, which can be provided
without memory encryption, e.g. if the TCB of the guest includes the host
kernel but not host userspace, or by isolation in hardware without
encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
further is undeseriable, and short of creating alises for all related
ioctl()s, which introduces a different flavor of divergence, KVM is stuck
with the nomenclature.

Defer renaming SVM's functions to a future commit as there are additional
changes needed to make SVM fully conforming and to match reality (looking
at you, svm_vm_copy_asid_from()).

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 8a28978 upstream.

The two ioctls used to implement userspace-accelerated TPR,
KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available
even if hardware-accelerated TPR can be used.  So there is
no reason not to report KVM_CAP_VAPIC.

[Backport changes]
- In commit 58fccda, report_flexpriority() is renamed
to vmx_cpu_has_accelerated_tpr().

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 2a89061 upstream.

SVM implements neither update_emulated_instruction nor
set_apic_access_page_addr.  Remove an "if" by calling them
with static_call_cond().

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit e4fc23b upstream.

The original use of KVM_X86_OP_NULL, which was to mark calls
that do not follow a specific naming convention, is not in use
anymore.  Instead, let's mark calls that are optional because
they are always invoked within conditionals or with static_call_cond.
Those that are _not_, i.e. those that are defined with KVM_X86_OP,
must be defined by both vendor modules or some kind of NULL pointer
dereference is bound to happen at runtime.

[Backport Changes]

Replace KVM_X86_OP_NULL with KVM_X86_OP_OPTIONAL for guest_memory_reclaimed() API
ensuring better alignment with upstream. changes.

Notably, APIs such as vm_copy_enc_context_from() and
vm_move_enc_context_from() are not part of our kernel, so they are excluded
from this change.

The backport commit f349144 uses KVM_X86_OP_NULL in the
vcpu_precreate() function, whereas the upstream. commit
d588bb9 has updated vcpu_precreate() to use KVM_X86_OP_OPTIONAL_RET0
instead, which is consistent with this change.

This update ensures consistency with the upstream. implementation and eliminates
legacy null operations.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit dd2319c upstream.

Use the newly corrected KVM_X86_OP annotations to warn about possible
NULL pointer dereferences as soon as the vendor module is loaded.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 5be2226 upstream.

A few vendor callbacks are only used by VMX, but they return an integer
or bool value.  Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is
NULL in struct kvm_x86_ops, it will be changed to __static_call_return0
when updating static calls.

[Backport changes]
In this commit f0f101b in file of "kernel/static_call.c"
added the EXPORT_SYMBOL_GPL(__static_call_return0);

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 9250887 upstream.

Cast kvm_x86_ops.func to 'void *' when updating KVM static calls that are
conditionally patched to __static_call_return0().  clang complains about
using mismatching pointers in the ternary operator, which breaks the
build when compiling with CONFIG_KVM_WERROR=y.

  >> arch/x86/include/asm/kvm-x86-ops.h:82:1: warning: pointer type mismatch
  ('bool (*)(struct kvm_vcpu *)' and 'void *') [-Wpointer-type-mismatch]

Fixes: 5be2226 ("KVM: x86: allow defining return-0 static calls")
Reported-by: Like Xu <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: David Dunn <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 58b3d12 upstream.

CPUID leaf 0x80000021 defines some features (or lack of bugs) of AMD
processors.  Expose the ones that make sense via KVM_GET_SUPPORTED_CPUID.

Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Arukonda Rahul <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
Like Xu and others added 28 commits May 9, 2025 15:32
commit 8de1854 upstream.

Move reprogram_counters() out of Intel specific PMU code and into pmu.h so
that it can be used to implement AMD PMU v2 support.

No functional change intended.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
[sean: rewrite changelog]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
… bits

commit 30dab5c upstream.

Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set
reserved bits.  Allowing userspace to stuff reserved bits doesn't harm KVM
itself, but it's architecturally wrong and the guest can't clear the
unsupported bits, e.g. makes the guest's PMI handler very confused.

Signed-off-by: Like Xu <[email protected]>
[sean: rewrite changelog to avoid use of #GP, rebase on name change]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit c85cdc1 upstream.

Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL,
a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code.
AMD PerfMonV2 defines three registers that have the same semantics as
Intel's variants, just with different names and indices.  Conveniently,
since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and
AMD's version shows up in v2, KVM can use common code for the existence
check as well.

[Backport changes]
This change removes the condition that returns the value of pmu->version > 1
from the file `arch/x86/kvm/vmx/pmu_intel.c`, which was included
in upstream commit b663f0b.

Signed-off-by: Like Xu <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 13afa29 upstream.

Move the Intel PMU implementation of pmc_is_enabled() to common x86 code
as pmc_is_globally_enabled(), and drop AMD's implementation.  AMD PMU
currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the
semantics for AMD are unchanged.  And when support for AMD PMU v2 comes
along, the common behavior will also Just Work.

Signed-off-by: Like Xu <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 6593039 upstream.

Add an explicit !enable_pmu check as relying on kvm_pmu_cap to be
zeroed isn't obvious. Although when !enable_pmu, KVM will have
zero-padded kvm_pmu_cap to do subsequent CPUID leaf assignments.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 6a08083 upstream.

Disable PMU support when running on AMD and perf reports fewer than four
general purpose counters. All AMD PMUs must define at least four counters
due to AMD's legacy architecture hardcoding the number of counters
without providing a way to enumerate the number of counters to software,
e.g. from AMD's APM:

 The legacy architecture defines four performance counters (PerfCtrn)
 and corresponding event-select registers (PerfEvtSeln).

Virtualizing fewer than four counters can lead to guest instability as
software expects four counters to be available. Rather than bleed AMD
details into the common code, just define a const unsigned int and
provide a convenient location to document why Intel and AMD have different
mins (in particular, AMD's lack of any way to enumerate less than four
counters to the guest).

Keep the minimum number of counters at Intel at one, even though old P6
and Core Solo/Duo processor effectively require a minimum of two counters.
KVM can, and more importantly has up until this point, supported a vPMU so
long as the CPU has at least one counter.  Perf's support for P6/Core CPUs
does require two counters, but perf will happily chug along with a single
counter when running on a modern CPU.

[Backport changes]
Adjusted tab space to align with upstream. commit style.
No functional change was made to the code in this section.

Cc: Jim Mattson <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
[sean: set Intel min to '1', not '2']
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit d338d87 upstream.

Enable and advertise PERFCTR_CORE if and only if the minimum number of
required counters are available, i.e. if perf says there are less than six
general purpose counters.

Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding
the check for host support.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
[sean: massage shortlog and changelog]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 1c2bf8a upstream.

Cap the number of general purpose counters enumerated on AMD to what KVM
actually supports, i.e. don't allow userspace to coerce KVM into thinking
there are more counters than actually exist, e.g. by enumerating
X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
[sean: massage changelog]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit fe8d76c upstream.

Add a KVM-only leaf for AMD's PerfMonV2 to redirect the kernel's scattered
version to its architectural location, e.g. so that KVM can query guest
support via guest_cpuid_has().

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
[sean: massage changelog]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: chaithanyaLagisetty <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 4a27718 upstream.

If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by
the guest, it can use a new scheme to manage the Core PMCs using the
new global control and status registers.

In addition to benefiting from the PerfMonV2 functionality in the same
way as the host (higher precision), the guest also can reduce the number
of vm-exits by lowering the total number of MSRs accesses.

In terms of implementation details, amd_is_valid_msr() is resurrected
since three newly added MSRs could not be mapped to one vPMC.
The possibility of emulating PerfMonV2 on the mainframe has also
been eliminated for reasons of precision.

Co-developed-by: Sandipan Das <[email protected]>
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Like Xu <[email protected]>
[sean: drop "Based on the observed HW." comments]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 94cdeeb upstream.

CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new
performance monitoring features for AMD processors.

Bit 0 of EAX indicates support for Performance Monitoring Version 2
(PerfMonV2) features. If found to be set during PMU initialization,
the EBX bits of the same CPUID function can be used to determine
the number of available PMCs for different PMU types.

Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that
guests can make use of the PerfMonV2 features.

Co-developed-by: Sandipan Das <[email protected]>
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit fd470a8 upstream.

Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not
provide protection to processes running at CPL3/user mode, see section
"Extended Feature Enable Register (EFER)" in the APM v2 at
https://bugzilla.kernel.org/attachment.cgi?id=304652

Explicitly enable STIBP to protect against cross-thread CPL3
branch target injections on systems with Automatic IBRS enabled.

Also update the relevant documentation.

Fixes: e7862ed ("x86/cpu: Support AMD Automatic IBRS")
Reported-by: Tom Lendacky <[email protected]>
Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: PvsNarasimha <[email protected]>
commit 3f2739b upstream.

Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in
the VM-Exit fastpath handler, as several of the common helpers used during
emulation expect the caller to provide SRCU protection.  E.g. if the guest
is counting instructions retired, KVM will query the PMU event filter when
stepping over the WRMSR.

  dump_stack+0x85/0xdf
  lockdep_rcu_suspicious+0x109/0x120
  pmc_event_is_allowed+0x165/0x170
  kvm_pmu_trigger_event+0xa5/0x190
  handle_fastpath_set_msr_irqoff+0xca/0x1e0
  svm_vcpu_run+0x5c3/0x7b0 [kvm_amd]
  vcpu_enter_guest+0x2108/0x2580

Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this
isn't the first bug of this nature, e.g. see commit 5c30e81 ("KVM:
SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid").  Providing
protection for the entirety of WRMSR emulation will allow reverting the
aforementioned commit, and will avoid having to play whack-a-mole when new
uses of SRCU-protected structures are inevitably added in common emulation
helpers.

[Backport changes]
Retain old srcu_read_lock/unlock() for compatibility due to upstream conflict

Upstream commit 2031f28 renames srcu_read_lock/unlock() to
kvm_vcpu_srcu_read_lock/unlock(). To avoid conflicts, the old implementation
is retained for compatibility until the issue is resolved.

Fixes: dfdeda6 ("KVM: x86/pmu: Prevent the PMU from counting disallowed events")
Reported-by: Greg Thelen <[email protected]>
Reported-by: Aaron Lewis <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit b29a2ac upstream.

Performance counters are defined to have width less than 64 bits.  The
vPMU code maintains the counters in u64 variables but assumes the value
to fit within the defined width.  However, for Intel non-full-width
counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
truncated to 32 bits and then sign-extended to full 64 bits.  If a
negative value is set, it's sign-extended to 64 bits, but then in
kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
previous value for overflow detection.

That previous value is not truncated, so it always evaluates bigger than
the truncated new one, and a PMI is injected.  If the PMI handler writes
a negative counter value itself, the vCPU never quits the PMI loop.

Turns out that Linux PMI handler actually does write the counter with
the value just read with RDPMC, so when no full-width support is exposed
via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
a negative value, it locks up.

This has been observed in the field, for example, when the guest configures
atop to use perfevents and runs two instances of it simultaneously.

To address the problem, maintain the invariant that the counter value
always fits in the defined bit width, by truncating the received value
in the respective set_msr methods.  For better readability, factor the
out into a helper function, pmc_write_counter(), shared by vmx and svm
parts.

Fixes: 9cd803d ("KVM: x86: Update vPMCs when retiring instructions")
Cc: [email protected]
Signed-off-by: Roman Kagan <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Tested-by: Like Xu <[email protected]>
[sean: tweak changelog, s/set/write in the helper]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…mode

commit 547c919 upstream.

When querying whether or not a vCPU "is" running in kernel mode, directly
get the CPL if the vCPU is the currently loaded vCPU.  In scenarios where
a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel
from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't
scheduled out due to being preempted and so preempted_in_kernel is stale.

This affects perf/core's ability to accurately tag guest RIP with
PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample.  This
causes perf/tool to fail to connect the vCPU RIPs to the guest kernel
space symbols when parsing these samples due to incorrect PERF_RECORD_MISC
flags:

   Before (perf-report of a cpu-cycles sample):
      1.23%  :58945   [unknown]         [u] 0xffffffff818012e0

   After:
      1.35%  :60703   [kernel.vmlinux]  [g] asm_exc_page_fault

Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful
as nothing in the API's suggests that it's safe to use if and only if the
vCPU was preempted.  That can be cleaned up in the future, for now just
fix the glaring correctness bug.

Note openvelinux#2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX
requires VMREAD, i.e. is correct if and only if the vCPU is loaded.  If
the target vCPU *was* preempted, then it can be scheduled back in after
the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up
trying to do VMREAD on a VMCS that isn't loaded on the current pCPU.

Signed-off-by: Like Xu <[email protected]>
Fixes: e1bfc24 ("KVM: Move x86's perf guest info callbacks to generic KVM")
Link: https://lore.kernel.org/r/[email protected]
[sean: massage changelong, add Fixes]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 9710794 upstream.

When commit c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE
MSR emulation for extended PEBS") switched the initialization of
cpuc->guest_switch_msrs to use compound literals, it screwed up
the boolean logic:

+	u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
...
-	arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-	arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
+               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),

Before the patch, the value of arr[0].guest would have been intel_ctrl &
~cpuc->intel_ctrl_host_mask & ~pebs_mask.  The intent is to always treat
PEBS events as host-only because, while the guest runs, there is no way
to tell the processor about the virtual address where to put PEBS records
intended for the host.

Unfortunately, the new expression can be expanded to

	(intel_ctrl & ~cpuc->intel_ctrl_host_mask) | (intel_ctrl & ~pebs_mask)

which makes no sense; it includes any bit that isn't *both* marked as
exclude_guest and using PEBS.  So, reinstate the old logic.  Another
way to write it could be "intel_ctrl & ~(cpuc->intel_ctrl_host_mask |
pebs_mask)", presumably the intention of the author of the faulty.
However, I personally find the repeated application of A AND NOT B to
be a bit more readable.

This shows up as guest failures when running concurrent long-running
perf workloads on the host, and was reported to happen with rcutorture.
All guests on a given host would die simultaneously with something like an
instruction fault or a segmentation violation.

Reported-by: Paul E. McKenney <[email protected]>
Analyzed-by: Sean Christopherson <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Cc: [email protected]
Fixes: c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 7e768ce upstream.

The kvm_pmu_refresh() may be called repeatedly (e.g. configure guest
CPUID repeatedly or update MSR_IA32_PERF_CAPABILITIES) and each
call will use the last pmu->all_valid_pmc_idx value, with the residual
bits introducing additional overhead later in the vPMU emulation.

Fixes: b35e554 ("KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC")
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 3a6de51 upstream.

Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES,
after running a vCPU, WARN and bug the VM if the PMU is refreshed after
the vCPU has run.

Note, KVM has disallowed CPUID updates after running a vCPU since commit
feb627e ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN"), i.e.
PERF_CAPABILITIES was the only remaining way to trigger a PMU refresh
after KVM_RUN.

[Backport changes]
Upstream commit fb3146b adds kvm_vcpu_has_run(), but due to
conflicts, the patch is skipped. The API definition is added
for backport compatibility.

Cc: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit f933b88 upstream.

Move the purging of common PMU metadata from intel_pmu_refresh() to
kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if
the VM is supposed to have a vPMU.

KVM already denies access to the PMU based on kvm->arch.enable_pmu, as
get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already
violates AMD's architecture by not virtualizing a PMU (kernels have long
since learned to not panic when the PMU is unavailable).  But configuring
the PMU as if it were enabled causes unwanted side effects, e.g. calls to
kvm_pmu_trigger_event() waste an absurd number of cycles due to the
all_valid_pmc_idx bitmap being non-zero.

Fixes: b1d66da ("KVM: x86/svm: Add module param to control PMU virtualization")
Reported-by: Konstantin Khorenko <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 05519c8 upstream.

Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl
when reprogramming fixed counters, as truncating the value results in KVM
thinking fixed counter 2 is already disabled (the bug also affects fixed
counters 3+, but KVM doesn't yet support those).  As a result, if the
guest disables fixed counter 2, KVM will get a false negative and fail to
reprogram/disable emulation of the counter, which can leads to incorrect
counts and spurious PMIs in the guest.

Fixes: 76d287b ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()")
Cc: [email protected]
Signed-off-by: Mingwei Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rewrite changelog to call out the effects of the bug]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 73554b2 upstream.

When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.

Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
trigger the PMI is still broken, albeit very theoretically.

E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
vCPU to be migrated to a different pCPU, then it's possible for
kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
KVM_REQ_PMI and still generate two PMIs.

KVM could set the mask bit using an atomic operation, but that'd just be
piling on unnecessary code to workaround what is effectively a hack.  The
*only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
event, e.g. if the vCPU just executed HLT.

Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().

Fixes: 9cd803d ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Jim Mattson <[email protected]>
Tested-by: Mingwei Zhang <[email protected]>
Tested-by: Dapeng Mi <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 4736d85 upstream.

Commit ee3a5f9 ("KVM: x86: Do runtime CPUID update before updating
vcpu->arch.cpuid_entries") moved tweaking of the supplied CPUID
data earlier in kvm_set_cpuid() but __kvm_update_cpuid_runtime() actually
uses 'vcpu->arch.kvm_cpuid' (though __kvm_find_kvm_cpuid_features()) which
gets set later in kvm_set_cpuid(). In some cases, e.g. when kvm_set_cpuid()
is called for the first time and 'vcpu->arch.kvm_cpuid' is clear,
__kvm_find_kvm_cpuid_features() fails to find KVM PV feature entry and the
logic which clears KVM_FEATURE_PV_UNHALT after enabling
KVM_X86_DISABLE_EXITS_HLT does not work.

The logic, introduced by the commit ee3a5f9 ("KVM: x86: Do runtime
CPUID update before updating vcpu->arch.cpuid_entries") must stay: the
supplied CPUID data is tweaked by KVM first (__kvm_update_cpuid_runtime())
and checked later (kvm_check_cpuid()) and the actual data
(vcpu->arch.cpuid_*, vcpu->arch.kvm_cpuid, vcpu->arch.xen.cpuid,..) is only
updated on success.

Switch to searching for KVM_SIGNATURE in the supplied CPUID data to
discover KVM PV feature entry instead of using stale 'vcpu->arch.kvm_cpuid'.

While on it, drop pointless "&& (best->eax & (1 << KVM_FEATURE_PV_UNHALT)"
check when clearing KVM_FEATURE_PV_UNHALT bit.

Fixes: ee3a5f9 ("KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries")
Reported-and-tested-by: Li RongQing <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit cf8e55f upstream.

The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 9e985cb upstream.

Drop support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken without an obvious/easy path forward, and because
exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
host kernel addresses to the guest.

Bug openvelinux#1 is that KVM doesn't account for the upper 32 bits of
IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g
fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters()
stores local variables as u8s and truncates the upper bits too, etc.

Bug openvelinux#2 is that, because KVM _always_ sets precise_ip to a non-zero value
for PEBS events, perf will _always_ generate an adaptive record, even if
the guest requested a basic record.  Note, KVM will also enable adaptive
PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the
guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero,
i.e. the guest will only ever see Basic records.

Bug openvelinux#3 is in perf.  intel_pmu_disable_fixed() doesn't clear the upper
bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and
intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE
either.  I.e. perf _always_ enables ADAPTIVE counters, regardless of what
KVM requests.

Bug #4 is that adaptive PEBS *might* effectively bypass event filters set
by the host, as "Updated Memory Access Info Group" records information
that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER.

Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least
zeros) when entering a vCPU with adaptive PEBS, which allows the guest
to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries"
records.

Disable adaptive PEBS support as an immediate fix due to the severity of
the LBR leak in particular, and because fixing all of the bugs will be
non-trivial, e.g. not suitable for backporting to stable kernels.

Note!  This will break live migration, but trying to make KVM play nice
with live migration would be quite complicated, wouldn't be guaranteed to
work (i.e. KVM might still kill/confuse the guest), and it's not clear
that there are any publicly available VMMs that support adaptive PEBS,
let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't
support PEBS in any capacity.

[Backport changes]
Retain changes in capabilities.h for vmx_get_perf_capabilities()

Upstream changes were made in arch/x86/kvm/vmx/vmx.c. For backport compatibility,
the changes are applied in arch/x86/kvm/vmx/capabilities.h
within vmx_get_perf_capabilities().

Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Fixes: c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: [email protected]
Cc: Like Xu <[email protected]>
Cc: Mingwei Zhang <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Cc: Zhang Xiong <[email protected]>
Cc: Lv Zhiyuan <[email protected]>
Cc: Dapeng Mi <[email protected]>
Cc: Jim Mattson <[email protected]>
Acked-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 1c4dc57 upstream.

The braces around the KVM_CAP_XSAVE2 block also surround the
KVM_CAP_PMU_CAPABILITY block, likely the result of a merge issue. Simply
move the curly brace back to where it belongs.

Fixes: ba7bb66 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")

Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
…re limit

commit 48639df upstream.

A subsequent patch will need to acquire the CPUID leaf range for emulated
Xen so explicitly pass the signature of the hypervisor we're interested in
to the new function. Also introduce a new kvm_hypervisor_cpuid structure
so we can neatly store both the base and limit leaf indices.

Signed-off-by: Paul Durrant <[email protected]>
Reviewed-by: David Woodhouse <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 92e82cf upstream.

Similar to kvm_find_kvm_cpuid_features()/__kvm_find_kvm_cpuid_features(),
introduce a helper to search for the specific hypervisor signature in any
struct kvm_cpuid_entry2 array, not only in vcpu->arch.cpuid_entries.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
commit 59cc99f upstream.

For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be
renamed for reuse by more callers, and remove the comment about LBR
use case can be deleted by the way.

Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.