[5.15-velinux] Backported KVM phase 2 patches for velinux-5.15 kernel #47

PvsNarasimha · 2025-05-09T09:59:43Z

Hardware Feature Enablement / Support

These patches add support for new CPU features, CPUID leaves, and hardware capabilities on Intel and AMD platforms.

x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled
KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022
KVM: x86/svm/pmu: Add AMD PerfMonV2 support
x86/cpu: Support AMD Automatic IBRS
x86/cpu, kvm: Add the Null Selector Clears Base feature
KVM: x86: add support for CPUID leaf 0x80000021
KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter
KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests

Performance Improvements

Optimizations to reduce overhead and improve runtime efficiency:

KVM: x86/pmu: Rewrite reprogram_counters() to improve performance
KVM: x86: Use static calls to reduce kvm_pmu_ops overhead
KVM: x86: Copy kvm_pmu_ops by value to eliminate layer of indirection
KVM: x86/pmu: Use binary search to check filtered events
KVM: x86/pmu: Avoid using PEBS perf_events for normal counters

Refactoring / Code Reorganization

These changes focus on improving maintainability, readability, and structure of the KVM/x86 codebase:

KVM: x86/cpuid: Refactor host/guest CPU model consistency check
KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
KVM: VMX: Refactor intel_pmu_{g,}set_msr() to align with other helpers
KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code
KVM: nVMX: Refactor PMU refresh to avoid referencing kvm_x86_ops.pmu_ops
KVM: x86: Move guts of kvm_arch_init() to standalone helper
KVM: x86/pmu: Move handling PERF_GLOBAL_CTRL and friends to common x86
Move various helpers (e.g., pmc_perf_hw_id())
KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks

Feature Enhancements (vPMU / CPUID / MSRs / etc.)

These provide new capabilities, configurable options, and fine-tuned control to user space or the guest:

KVM: x86: Provide per VM capability for disabling PMU virtualization
KVM: x86/svm: Add module param to control PMU virtualization
KVM: x86/pmu: Restrict advanced features based on module enable_pmu
KVM: x86/pmu: Advertise PERFCTR_CORE iff the min nr of counters is met
KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
KVM: x86: Move lookup of indexed CPUID leafs to helper

User/Guest Safety / Fault Isolation

These prevent invalid or insecure guest/host interactions:

KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
KVM: x86/pmu: Reject userspace attempts to set reserved GLOBAL_STATUS bits
KVM: x86/pmu: Prevent the PMU from counting disallowed events
KVM: x86/pmu: Limit the maximum number of supported AMD GP counters
KVM: x86/pmu: Limit the maximum number of supported Intel GP counters
KVM: x86/pmu: WARN and bug the VM if PMU is refreshed after vCPU has run

Bug Fixes

These patches address correctness issues, warnings, and reliability concerns in KVM and vPMU:

KVM: x86: Fix errant brace in KVM capability handling
KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
KVM: x86: Fix pointer mistmatch warning when patching RET0 static calls
KVM: x86: Fix clang -Wimplicit-fallthrough in do_host_cpuid()
KVM: x86: avoid out of bounds indices for fixed performance counters
KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
kvm: x86/pmu: Fix the compare function used by the pmu event filter

Documentation or Cleanup / Misc

Cleanup and changes that make the codebase more developer-friendly:

docs: kvm: x86: Fix broken field list
KVM: x86/pmu: Rename global_ovf_ctrl_mask to global_status_mask
KVM: x86/pmu: Rename pmc_is_enabled() to pmc_is_globally_enabled()
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
KVM: x86: Move CPUID.(EAX=0x12,ECX=1) mangling to __kvm_update_cpuid_runtime()
KVM: x86: use static_call_cond for optional callbacks

Run Test cases

$ git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
$ cd kvm-unit-tests/
$ ./configure
$ make
$ ./run_tests.sh

root@volcano9b5e-os:/home/amd/Linux_Backport/kvm-unit-tests# ./run_tests.sh
PASS apic-split (56 tests)
PASS ioapic-split (19 tests)
PASS x2apic (56 tests)
FAIL xapic (timeout; duration=60)
PASS ioapic (26 tests)
SKIP cmpxchg8b (i386 only)
PASS smptest (1 tests)
PASS smptest3 (1 tests)
PASS vmexit_cpuid
PASS vmexit_vmcall
PASS vmexit_mov_from_cr8
PASS vmexit_mov_to_cr8
PASS vmexit_inl_pmtimer
PASS vmexit_ipi
PASS vmexit_ipi_halt
PASS vmexit_ple_round_robin
PASS vmexit_tscdeadline
PASS vmexit_tscdeadline_immed
PASS vmexit_cr0_wp
PASS vmexit_cr4_pge
PASS access (2 tests)
SKIP access_fep (test marked as manual run only)
SKIP access-reduced-maxphyaddr (/sys/module/kvm_intel/parameters/allow_smaller_maxphyaddr not equal to Y)
PASS smap (18 tests)
PASS pku (7 tests)
SKIP pks (0 tests)
PASS asyncpf (2 tests, 1 skipped)
PASS emulator (140 tests, 2 skipped)
PASS eventinj (13 tests)
PASS hypercall (2 tests)
PASS idt_test (4 tests)
PASS memory (7 tests, 1 skipped)
PASS msr (1836 tests)
SKIP pmu (/proc/sys/kernel/nmi_watchdog not equal to 0)
SKIP pmu_lbr (/proc/sys/kernel/nmi_watchdog not equal to 0)
SKIP pmu_pebs (/proc/sys/kernel/nmi_watchdog not equal to 0)
SKIP vmware_backdoors (/sys/module/kvm/parameters/enable_vmware_backdoor not equal to Y)
PASS realmode
PASS s3
PASS setjmp (10 tests)
PASS sieve
PASS syscall (2 tests)
PASS tsc (6 tests)
PASS tsc_adjust (6 tests)
PASS xsave (17 tests)
PASS rmap_chain
FAIL svm
SKIP svm_pause_filter (1 tests, 1 skipped)
PASS svm_npt (103 tests)
SKIP taskswitch (i386 only)
SKIP taskswitch2 (i386 only)
PASS kvmclock_test
PASS pcid-enabled (2 tests)
PASS pcid-disabled (2 tests)
PASS pcid-asymmetric (2 tests)
PASS rdpru (1 tests)
PASS umip (21 tests)
SKIP la57 (i386 only)
SKIP vmx (0 tests)
SKIP ept (0 tests)
SKIP vmx_eoi_bitmap_ioapic_scan (0 tests)
SKIP vmx_hlt_with_rvi_test (0 tests)
SKIP vmx_apicv_test (0 tests)
SKIP vmx_posted_intr_test (0 tests)
SKIP vmx_apic_passthrough_thread (0 tests)
SKIP vmx_init_signal_test (0 tests)
SKIP vmx_sipi_signal_test (0 tests)
SKIP vmx_apic_passthrough_tpr_threshold_test (0 tests)
SKIP vmx_vmcs_shadow_test (0 tests)
SKIP vmx_pf_exception_test (0 tests)
SKIP vmx_pf_exception_test_fep (test marked as manual run only)
SKIP vmx_pf_vpid_test (test marked as manual run only)
SKIP vmx_pf_invvpid_test (test marked as manual run only)
SKIP vmx_pf_no_vpid_test (test marked as manual run only)
SKIP vmx_pf_exception_test_reduced_maxphyaddr (/sys/module/kvm_intel/parameters/allow_smaller_maxphyaddr not equal to Y)
PASS debug (23 tests)
PASS hyperv_synic (1 tests)
PASS hyperv_connections (7 tests)
PASS hyperv_stimer (12 tests)
PASS hyperv_stimer_direct (8 tests)
PASS hyperv_clock (3 tests)
PASS intel_iommu (11 tests)
SKIP tsx-ctrl (1 tests, 1 skipped)
SKIP intel_cet (0 tests)

Verified via debug logs

root@volcano9dee-host:/home/amd# dmesg | grep Malathi 
[  29.605656] amd_pmu_vs_enable_all is called : Malathi 
[  29.607491] amd_pmu_ _v2_disable_all is called : Malathi
[  29.618834] amd_pmu_vs_disable_all is called : Malathi 
[  29.619491] amd_pmu_v2_enable_all is called : Malathi 
[  29.619530] amd_pmu_vs_disable_all is called : Malathi 
[  29.620491] amd_pmu_v2_enable_all is called : Malathi 
[  29.620493] amd_pmu_vs_disable_all is called : Malathi 
[  29.621491] amd_pmu_v2_enable_all is called : Malathi 
[  29.621491] amd_pmu_vs_disable_all is called : Malathi 
[  29.621491] amd_pmu_v2_enable_all is called : Malathi 
[  29.624441] amd_pmu_vs_disable_all is called : Malathi

Validated PMU with perf

root@volcano9dee-host:/home/amd/upstream_work/original_kernel_velinux/kernel/tools/perf# ./perf stat -e cycles,instructions a.out sleep 3
Hello,World
 
Performance counter stats for 'a.out sleep 3':
 
          5,74,208      cycles
          6,08,549      instructions              #    1.06  insn per cycle
 
       0.008310204 seconds time elapsed
 
       0.000000000 seconds user
       0.000797000 seconds sys

root@volcano9dee-host:/home/amd/upstream_work/original_kernel_velinux/kernel/tools/perf# ./perf record
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.273 MB perf.data (1560 samples) ]

./perf report | grep pmu
     0.11%  swapper          [kernel.kallsyms]         [k] x86_pmu_disable
     0.02%  perf             [kernel.kallsyms]         [k] amd_pmu_v2_enable_all

Booted VM with QEMU

vi vmc_script.sh //Copy the below script

#!/bin/bash

qemu-system-x86_64 \
 -m 2048 \
 -enable-kvm \
 -cpu host \
 -smp 2 \
 -hda /vms/images/velinux_chaithu.qcow2 \
 -cdrom /vms/iso/velinux-2.1-amd64-DVD-1.iso \
 -boot d \
 -netdev user,id=net0,hostfwd=tcp::2222-:22 \
 -device virtio-net-pci,netdev=net0 \
 -vnc :0

Run the script to create a VM

chmod +x vmc_script.sh
./vmc_script.sh

commit 49ff3b4 upstream. On AMD and Hygon platforms, the local APIC does not automatically set the mask bit of the LVTPC register when handling a PMI and there is no need to clear it in the kernel's PMI handler. For guests, the mask bit is currently set by kvm_apic_local_deliver() and unless it is cleared by the guest kernel's PMI handler, PMIs stop arriving and break use-cases like sampling with perf record. This does not affect non-PerfMonV2 guests because PMIs are handled in the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC mask bit irrespective of the vendor. Before: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ] After: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ] Fixes: a16eb25 ("KVM: x86: Mask LVTPC when handling a PMI") Cc: [email protected] Signed-off-by: Sandipan Das <[email protected]> Reviewed-by: Jim Mattson <[email protected]> [sean: use is_intel_compatible instead of !is_amd_or_hygon()] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Arukonda Rahul <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit fd706c9 upstream. Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel vs. AMD behavior related to masking the LVTPC entry, KVM will need to check for vendor compatibility on every PMI injection, i.e. querying for AMD will soon be a moderately hot path. Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's default behavior, both if userspace omits (or never sets) CPUID 0x0 and if userspace sets a completely unknown vendor. One could argue that KVM should treat such vCPUs as not being compatible with Intel *or* AMD, but that would add useless complexity to KVM. KVM needs to do *something* in the face of vendor specific behavior, and so unless KVM conjured up a magic third option, choosing to treat unknown vendors as neither Intel nor AMD means that checks on AMD compatibility would yield Intel behavior, and checks for Intel compatibility would yield AMD behavior. And that's far worse as it would effectively yield random behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs. !Intel. And practically speaking, all x86 CPUs follow either Intel or AMD architecture, i.e. "supporting" an unknown third architecture adds no value. Deliberately don't convert any of the existing guest_cpuid_is_intel() checks, as the Intel side of things is messier due to some flows explicitly checking for exactly vendor==Intel, versus some flows assuming anything that isn't "AMD compatible" gets Intel behavior. The Intel code will be cleaned up in the future. Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Arukonda Rahul <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 25b9784 upstream. Manually look for a CPUID.0x1 entry instead of bouncing through kvm_cpuid() when retrieving the Family-Model-Stepping information for vCPU RESET/INIT. This fixes a potential undefined behavior bug due to kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_, a.k.a. the index. A more minimal fix would be to simply zero "dummy", but the extra work in kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR. Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as holding the CPU's FMS information, not as holding CPUID.0x1.EAX. KVM's usage of CPUID entries to get FMS is simply a pragmatic approach to avoid having yet another way for userspace to provide inconsistent data. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Arukonda Rahul <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 540c7ab upstream. SDM section 18.2.3 mentioned that: "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of any general-purpose or fixed-function counters via a single WRMSR." It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing, the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop global_ovf_ctrl variable. Tested-by: Like Xu <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

…able commit 73cd107 upstream. Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest' variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu. kvm_before/after_interrupt() must be called while the vCPU is loaded, (which protects against preemption), thus kvm_running_vcpu is guaranteed to be non-NULL when handling_intr_from_guest is non-zero. Switching to kvm_get_running_vcpu() will allows moving KVM's perf callbacks to generic code, and the new flag will be used in a future patch to more precisely identify the "NMI from guest" case. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: PvsNarasimha <[email protected]>

commit e1bfc24 upstream. Move x86's perf guest callbacks into common KVM, as they are semantically identical to arm64's callbacks (the only other such KVM callbacks). arm64 will convert to the common versions in a future patch. Implement the necessary arm64 arch hooks now to avoid having to provide stubs or a temporary #define (from x86) to avoid arm64 compilation errors when CONFIG_GUEST_PERF_EVENTS=y. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: PvsNarasimha <[email protected]>

commit b1d66da upstream. For Intel, the guest PMU can be disabled via clearing the PMU CPUID. For AMD, all hw implementations support the base set of four performance counters, with current mainstream hardware indicating the presence of two additional counters via X86_FEATURE_PERFCTR_CORE. In the virtualized world, the AMD guest driver may detect the presence of at least one counter MSR. Most hypervisor vendors would introduce a module param (like lbrv for svm) to disable PMU for all guests. Another control proposal per-VM is to pass PMU disable information via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00]. Both of methods require some guest-side changes, so a module parameter may not be sufficiently granular, but practical enough. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 006a0f0 upstream. Because IceLake has 4 fixed performance counters but KVM only supports 3, it is possible for reprogram_fixed_counters to pass to reprogram_fixed_counter an index that is out of bounds for the fixed_pmc_events array. Ultimately intel_find_fixed_event, which is the only place that uses fixed_pmc_events, handles this correctly because it checks against the size of fixed_pmc_events anyway. Every other place operates on the fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED. However, it is cleaner if the unsupported performance counters are culled early on in reprogram_fixed_counters. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 7618756 upstream. The current pmc->eventsel for fixed counter is underutilised. The pmc->eventsel can be setup for all known available fixed counters since we have mapping between fixed pmc index and the intel_arch_events array. Either gp or fixed counter, it will simplify the later checks for consistency between eventsel and perf_hw_id. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 6ed1298 upstream. Since we set the same semantic event value for the fixed counter in pmc->eventsel, returning the perf_hw_id for the fixed counter via find_fixed_event() can be painlessly replaced by pmc_perf_hw_id() with the help of pmc_is_fixed() check. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 40ccb96 upstream. Depending on whether intr should be triggered or not, KVM registers two different event overflow callbacks in the perf_event context. The code skeleton of these two functions is very similar, so the pmc->intr can be stored into pmc from pmc_reprogram_counter() which provides smaller instructions footprint against the u-architecture branch predictor. The __kvm_perf_overflow() can be called in non-nmi contexts and a flag is needed to distinguish the caller context and thus avoid a check on kvm_is_in_guest(), otherwise we might get warnings from suspicious RCU or check_preemption_disabled(). [Backport Changes] - In commit b9f5621, kvm_is_in_guest() was changed to kvm_guest_state(). - In commit 73cd107, kvm_guest_state() was updated to kvm_handling_nmi_from_guest(). - In commit 40ccb96, kvm_is_in_guest() was removed, but instead of removing kvm_handling_nmi_from_guest(pmc->vcpu) was retained for compatibility - This backported patch adds kvm_handling_nmi_from_guest(pmc->vcpu) instead of kvm_is_in_guest() for compatibility. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 9cd803d upstream. When KVM retires a guest instruction through emulation, increment any vPMCs that are configured to monitor "instructions retired," and update the sample period of those counters so that they will overflow at the right time. Signed-off-by: Eric Hankland <[email protected]> [jmattson: - Split the code to increment "branch instructions retired" into a separate commit. - Added 'static' to kvm_pmu_incr_counter() definition. - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE. ] Fixes: f5132b0 ("KVM: Expose a version 2 architectural PMU to a guests") Signed-off-by: Jim Mattson <[email protected]> [likexu: - Drop checks for pmc->perf_event or event state or event type - Increase a counter once its umask bits and the first 8 select bits are matched - Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf; - Rename kvm_pmu_record_event to kvm_pmu_trigger_event; - Add counter enable and CPL check for kvm_pmu_trigger_event(); ] Cc: Peter Zijlstra <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 405329f upstream. Normally guests will set up CR3 themselves, but some guests, such as kselftests, and potentially CONFIG_PVH guests, rely on being booted with paging enabled and CR3 initialized to a pre-allocated page table. Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this is too late, since it will have switched over to using the VMSA page prior to that point, with the VMSA CR3 copied from the VMCB initial CR3 value: 0. Address this by sync'ing the CR3 value into the VMCB save area immediately when KVM_SET_SREGS* is issued so it will find it's way into the initial VMSA. Suggested-by: Tom Lendacky <[email protected]> Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the new hook. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

…tries commit ee3a5f9 upstream. kvm_update_cpuid_runtime() mangles CPUID data coming from userspace VMM after updating 'vcpu->arch.cpuid_entries', this makes it impossible to compare an update with what was previously supplied. Introduce __kvm_update_cpuid_runtime() version which can be used to tweak the input before it goes to 'vcpu->arch.cpuid_entries' so the upcoming update check can compare tweaked data. No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 4732f24 upstream. The new module parameter to control PMU virtualization should apply to Intel as well as AMD, for situations where userspace is not trusted. If the module parameter allows PMU virtualization, there could be a new KVM_CAP or guest CPUID bits whereby userspace can enable/disable PMU virtualization on a per-VM basis. If the module parameter does not allow PMU virtualization, there should be no userspace override, since we have no precedent for authorizing that kind of override. If it's false, other counter-based profiling features (such as LBR including the associated CPUID bits if any) will not be exposed. Change its name from "pmu" to "enable_pmu" as we have temporary variables with the same name in our code like "struct kvm_pmu *pmu". Fixes: b1d66da ("KVM: x86/svm: Add module param to control PMU virtualization") Suggested-by : Jim Mattson <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 7ff775a upstream. The PMU event filter may contain up to 300 events. Replace the linear search in reprogram_gp_counter() with a binary search. Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit c3e8abf upstream. Drop kvm_x86_ops' pre/post_block() now that all implementations are nops. No functional change intended. [Backport Changes] Definitions of pi_{pre, post}_block() were removed in the commit: d76fb40 Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

…runtime() commit 5c89be1 upstream. Full equality check of CPUID data on update (kvm_cpuid_check_equal()) may fail for SGX enabled CPUs as CPUID.(EAX=0x12,ECX=1) is currently being mangled in kvm_vcpu_after_set_cpuid(). Move it to __kvm_update_cpuid_runtime() and split off cpuid_get_supported_xcr0() helper as 'vcpu->arch.guest_supported_xcr0' update needs (logically) to stay in kvm_vcpu_after_set_cpuid(). Cc: [email protected] Fixes: feb627e ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN") Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Arukonda Rahul <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 2746a6b upstream. Hypervisor leaves are always synthesized by __do_cpuid_func; just return zeroes and do not ask the host. Even on nested virtualization, a value from another hypervisor would be bogus, because all hypercalls and MSRs are processed by KVM. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Arukonda Rahul <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit feee3d9 upstream. Remove the export of kvm_x86_tlb_flush_current() as there are no longer any users outside of common x86 code. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit e27bc04 upstream. Rename a variety of kvm_x86_op function pointers so that preferred name for vendor implementations follows the pattern <vendor>_<function>, e.g. rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will allow vendor implementations to be wired up via the KVM_X86_OP macro. In many cases, VMX and SVM "disagree" on the preferred name, though in reality it's VMX and x86 that disagree as SVM blindly prepended _svm to the kvm_x86_ops name. Justification for using the VMX nomenclature: - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an event that has already been "set" in e.g. the vIRR. SVM's relevant VMCB field is even named event_inj, and KVM's stat is irq_injections. - prepare_guest_switch => prepare_switch_to_guest because the former is ambiguous, e.g. it could mean switching between multiple guests, switching from the guest to host, etc... - update_pi_irte => pi_update_irte to allow for matching match the rest of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>(). - start_assignment => pi_start_assignment to again follow VMX's posted interrupt naming scheme, and to provide context for what bit of code might care about an otherwise undescribed "assignment". The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave that change for another time as the Hyper-V hooks always start as NULL, i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all names requires an astounding amount of churn. VMX and SVM function names are intentionally left as is to minimize the diff. Both VMX and SVM will need to rename even more functions in order to fully utilize KVM_X86_OPS, i.e. an additional patch for each is inevitable. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 0bcd556 upstream. Refactor the nested VMX PMU refresh helper to pass it a flag stating whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX helper query the information by bouncing through kvm_x86_ops.pmu_ops. This will allow a future patch to use static_call() for the PMU ops without having to export any static call definitions from common x86, and it is also a step toward unexported kvm_x86_ops. Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection and would require exporting kvm_pmu_is_valid_msr(). Opportunistically rename the helper to keep line lengths somewhat reasonable, and to better capture its high-level role. No functional change intended. Cc: Like Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 03d004c upstream. Use slightly more verbose names for the so called "memory encrypt", a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current super short kvm_x86_ops names and SVM's more verbose, but non-conforming names. This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP() to fill svm_x86_ops. Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better reflect its true nature, as it really is a full fledged ioctl() of its own. Ideally, the hook would be named confidential_vm_ioctl() or so, as the ioctl() is a gateway to more than just memory encryption, and because its underlying purpose to support Confidential VMs, which can be provided without memory encryption, e.g. if the TCB of the guest includes the host kernel but not host userspace, or by isolation in hardware without encrypting memory. But, diverging from KVM_MEMORY_ENCRYPT_OP even further is undeseriable, and short of creating alises for all related ioctl()s, which introduces a different flavor of divergence, KVM is stuck with the nomenclature. Defer renaming SVM's functions to a future commit as there are additional changes needed to make SVM fully conforming and to match reality (looking at you, svm_vm_copy_asid_from()). No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 8a28978 upstream. The two ioctls used to implement userspace-accelerated TPR, KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available even if hardware-accelerated TPR can be used. So there is no reason not to report KVM_CAP_VAPIC. [Backport changes] - In commit 58fccda, report_flexpriority() is renamed to vmx_cpu_has_accelerated_tpr(). Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 2a89061 upstream. SVM implements neither update_emulated_instruction nor set_apic_access_page_addr. Remove an "if" by calling them with static_call_cond(). Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit e4fc23b upstream. The original use of KVM_X86_OP_NULL, which was to mark calls that do not follow a specific naming convention, is not in use anymore. Instead, let's mark calls that are optional because they are always invoked within conditionals or with static_call_cond. Those that are _not_, i.e. those that are defined with KVM_X86_OP, must be defined by both vendor modules or some kind of NULL pointer dereference is bound to happen at runtime. [Backport Changes] Replace KVM_X86_OP_NULL with KVM_X86_OP_OPTIONAL for guest_memory_reclaimed() API ensuring better alignment with upstream. changes. Notably, APIs such as vm_copy_enc_context_from() and vm_move_enc_context_from() are not part of our kernel, so they are excluded from this change. The backport commit f349144 uses KVM_X86_OP_NULL in the vcpu_precreate() function, whereas the upstream. commit d588bb9 has updated vcpu_precreate() to use KVM_X86_OP_OPTIONAL_RET0 instead, which is consistent with this change. This update ensures consistency with the upstream. implementation and eliminates legacy null operations. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit dd2319c upstream. Use the newly corrected KVM_X86_OP annotations to warn about possible NULL pointer dereferences as soon as the vendor module is loaded. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 5be2226 upstream. A few vendor callbacks are only used by VMX, but they return an integer or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is NULL in struct kvm_x86_ops, it will be changed to __static_call_return0 when updating static calls. [Backport changes] In this commit f0f101b in file of "kernel/static_call.c" added the EXPORT_SYMBOL_GPL(__static_call_return0); Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 9250887 upstream. Cast kvm_x86_ops.func to 'void *' when updating KVM static calls that are conditionally patched to __static_call_return0(). clang complains about using mismatching pointers in the ternary operator, which breaks the build when compiling with CONFIG_KVM_WERROR=y. >> arch/x86/include/asm/kvm-x86-ops.h:82:1: warning: pointer type mismatch ('bool (*)(struct kvm_vcpu *)' and 'void *') [-Wpointer-type-mismatch] Fixes: 5be2226 ("KVM: x86: allow defining return-0 static calls") Reported-by: Like Xu <[email protected]> Reported-by: kernel test robot <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: David Dunn <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 58b3d12 upstream. CPUID leaf 0x80000021 defines some features (or lack of bugs) of AMD processors. Expose the ones that make sense via KVM_GET_SUPPORTED_CPUID. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Arukonda Rahul <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 8de1854 upstream. Move reprogram_counters() out of Intel specific PMU code and into pmu.h so that it can be used to implement AMD PMU v2 support. No functional change intended. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: rewrite changelog] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

… bits commit 30dab5c upstream. Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set reserved bits. Allowing userspace to stuff reserved bits doesn't harm KVM itself, but it's architecturally wrong and the guest can't clear the unsupported bits, e.g. makes the guest's PMI handler very confused. Signed-off-by: Like Xu <[email protected]> [sean: rewrite changelog to avoid use of #GP, rebase on name change] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit c85cdc1 upstream. Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL, a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code. AMD PerfMonV2 defines three registers that have the same semantics as Intel's variants, just with different names and indices. Conveniently, since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and AMD's version shows up in v2, KVM can use common code for the existence check as well. [Backport changes] This change removes the condition that returns the value of pmu->version > 1 from the file `arch/x86/kvm/vmx/pmu_intel.c`, which was included in upstream commit b663f0b. Signed-off-by: Like Xu <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 13afa29 upstream. Move the Intel PMU implementation of pmc_is_enabled() to common x86 code as pmc_is_globally_enabled(), and drop AMD's implementation. AMD PMU currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the semantics for AMD are unchanged. And when support for AMD PMU v2 comes along, the common behavior will also Just Work. Signed-off-by: Like Xu <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 6593039 upstream. Add an explicit !enable_pmu check as relying on kvm_pmu_cap to be zeroed isn't obvious. Although when !enable_pmu, KVM will have zero-padded kvm_pmu_cap to do subsequent CPUID leaf assignments. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 6a08083 upstream. Disable PMU support when running on AMD and perf reports fewer than four general purpose counters. All AMD PMUs must define at least four counters due to AMD's legacy architecture hardcoding the number of counters without providing a way to enumerate the number of counters to software, e.g. from AMD's APM: The legacy architecture defines four performance counters (PerfCtrn) and corresponding event-select registers (PerfEvtSeln). Virtualizing fewer than four counters can lead to guest instability as software expects four counters to be available. Rather than bleed AMD details into the common code, just define a const unsigned int and provide a convenient location to document why Intel and AMD have different mins (in particular, AMD's lack of any way to enumerate less than four counters to the guest). Keep the minimum number of counters at Intel at one, even though old P6 and Core Solo/Duo processor effectively require a minimum of two counters. KVM can, and more importantly has up until this point, supported a vPMU so long as the CPU has at least one counter. Perf's support for P6/Core CPUs does require two counters, but perf will happily chug along with a single counter when running on a modern CPU. [Backport changes] Adjusted tab space to align with upstream. commit style. No functional change was made to the code in this section. Cc: Jim Mattson <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: set Intel min to '1', not '2'] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit d338d87 upstream. Enable and advertise PERFCTR_CORE if and only if the minimum number of required counters are available, i.e. if perf says there are less than six general purpose counters. Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding the check for host support. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: massage shortlog and changelog] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 1c2bf8a upstream. Cap the number of general purpose counters enumerated on AMD to what KVM actually supports, i.e. don't allow userspace to coerce KVM into thinking there are more counters than actually exist, e.g. by enumerating X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: massage changelog] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit fe8d76c upstream. Add a KVM-only leaf for AMD's PerfMonV2 to redirect the kernel's scattered version to its architectural location, e.g. so that KVM can query guest support via guest_cpuid_has(). Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: massage changelog] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: chaithanyaLagisetty <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 4a27718 upstream. If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by the guest, it can use a new scheme to manage the Core PMCs using the new global control and status registers. In addition to benefiting from the PerfMonV2 functionality in the same way as the host (higher precision), the guest also can reduce the number of vm-exits by lowering the total number of MSRs accesses. In terms of implementation details, amd_is_valid_msr() is resurrected since three newly added MSRs could not be mapped to one vPMC. The possibility of emulating PerfMonV2 on the mainframe has also been eliminated for reasons of precision. Co-developed-by: Sandipan Das <[email protected]> Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: drop "Based on the observed HW." comments] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 94cdeeb upstream. CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 0 of EAX indicates support for Performance Monitoring Version 2 (PerfMonV2) features. If found to be set during PMU initialization, the EBX bits of the same CPUID function can be used to determine the number of available PMCs for different PMU types. Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that guests can make use of the PerfMonV2 features. Co-developed-by: Sandipan Das <[email protected]> Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit fd470a8 upstream. Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not provide protection to processes running at CPL3/user mode, see section "Extended Feature Enable Register (EFER)" in the APM v2 at https://bugzilla.kernel.org/attachment.cgi?id=304652 Explicitly enable STIBP to protect against cross-thread CPL3 branch target injections on systems with Automatic IBRS enabled. Also update the relevant documentation. Fixes: e7862ed ("x86/cpu: Support AMD Automatic IBRS") Reported-by: Tom Lendacky <[email protected]> Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: PvsNarasimha <[email protected]>

commit 3f2739b upstream. Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in the VM-Exit fastpath handler, as several of the common helpers used during emulation expect the caller to provide SRCU protection. E.g. if the guest is counting instructions retired, KVM will query the PMU event filter when stepping over the WRMSR. dump_stack+0x85/0xdf lockdep_rcu_suspicious+0x109/0x120 pmc_event_is_allowed+0x165/0x170 kvm_pmu_trigger_event+0xa5/0x190 handle_fastpath_set_msr_irqoff+0xca/0x1e0 svm_vcpu_run+0x5c3/0x7b0 [kvm_amd] vcpu_enter_guest+0x2108/0x2580 Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this isn't the first bug of this nature, e.g. see commit 5c30e81 ("KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"). Providing protection for the entirety of WRMSR emulation will allow reverting the aforementioned commit, and will avoid having to play whack-a-mole when new uses of SRCU-protected structures are inevitably added in common emulation helpers. [Backport changes] Retain old srcu_read_lock/unlock() for compatibility due to upstream conflict Upstream commit 2031f28 renames srcu_read_lock/unlock() to kvm_vcpu_srcu_read_lock/unlock(). To avoid conflicts, the old implementation is retained for compatibility until the issue is resolved. Fixes: dfdeda6 ("KVM: x86/pmu: Prevent the PMU from counting disallowed events") Reported-by: Greg Thelen <[email protected]> Reported-by: Aaron Lewis <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit b29a2ac upstream. Performance counters are defined to have width less than 64 bits. The vPMU code maintains the counters in u64 variables but assumes the value to fit within the defined width. However, for Intel non-full-width counters (MSR_IA32_PERFCTRx) the value receieved from the guest is truncated to 32 bits and then sign-extended to full 64 bits. If a negative value is set, it's sign-extended to 64 bits, but then in kvm_pmu_incr_counter() it's incremented, truncated, and compared to the previous value for overflow detection. That previous value is not truncated, so it always evaluates bigger than the truncated new one, and a PMI is injected. If the PMI handler writes a negative counter value itself, the vCPU never quits the PMI loop. Turns out that Linux PMI handler actually does write the counter with the value just read with RDPMC, so when no full-width support is exposed via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to a negative value, it locks up. This has been observed in the field, for example, when the guest configures atop to use perfevents and runs two instances of it simultaneously. To address the problem, maintain the invariant that the counter value always fits in the defined bit width, by truncating the received value in the respective set_msr methods. For better readability, factor the out into a helper function, pmc_write_counter(), shared by vmx and svm parts. Fixes: 9cd803d ("KVM: x86: Update vPMCs when retiring instructions") Cc: [email protected] Signed-off-by: Roman Kagan <[email protected]> Link: https://lore.kernel.org/all/[email protected] Tested-by: Like Xu <[email protected]> [sean: tweak changelog, s/set/write in the helper] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

…mode commit 547c919 upstream. When querying whether or not a vCPU "is" running in kernel mode, directly get the CPL if the vCPU is the currently loaded vCPU. In scenarios where a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't scheduled out due to being preempted and so preempted_in_kernel is stale. This affects perf/core's ability to accurately tag guest RIP with PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample. This causes perf/tool to fail to connect the vCPU RIPs to the guest kernel space symbols when parsing these samples due to incorrect PERF_RECORD_MISC flags: Before (perf-report of a cpu-cycles sample): 1.23% :58945 [unknown] [u] 0xffffffff818012e0 After: 1.35% :60703 [kernel.vmlinux] [g] asm_exc_page_fault Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful as nothing in the API's suggests that it's safe to use if and only if the vCPU was preempted. That can be cleaned up in the future, for now just fix the glaring correctness bug. Note openvelinux#2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX requires VMREAD, i.e. is correct if and only if the vCPU is loaded. If the target vCPU *was* preempted, then it can be scheduled back in after the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up trying to do VMREAD on a VMCS that isn't loaded on the current pCPU. Signed-off-by: Like Xu <[email protected]> Fixes: e1bfc24 ("KVM: Move x86's perf guest info callbacks to generic KVM") Link: https://lore.kernel.org/r/[email protected] [sean: massage changelong, add Fixes] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 9710794 upstream. When commit c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") switched the initialization of cpuc->guest_switch_msrs to use compound literals, it screwed up the boolean logic: + u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable; ... - arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask; - arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable); + .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask), Before the patch, the value of arr[0].guest would have been intel_ctrl & ~cpuc->intel_ctrl_host_mask & ~pebs_mask. The intent is to always treat PEBS events as host-only because, while the guest runs, there is no way to tell the processor about the virtual address where to put PEBS records intended for the host. Unfortunately, the new expression can be expanded to (intel_ctrl & ~cpuc->intel_ctrl_host_mask) | (intel_ctrl & ~pebs_mask) which makes no sense; it includes any bit that isn't *both* marked as exclude_guest and using PEBS. So, reinstate the old logic. Another way to write it could be "intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask)", presumably the intention of the author of the faulty. However, I personally find the repeated application of A AND NOT B to be a bit more readable. This shows up as guest failures when running concurrent long-running perf workloads on the host, and was reported to happen with rcutorture. All guests on a given host would die simultaneously with something like an instruction fault or a segmentation violation. Reported-by: Paul E. McKenney <[email protected]> Analyzed-by: Sean Christopherson <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Cc: [email protected] Fixes: c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 7e768ce upstream. The kvm_pmu_refresh() may be called repeatedly (e.g. configure guest CPUID repeatedly or update MSR_IA32_PERF_CAPABILITIES) and each call will use the last pmu->all_valid_pmc_idx value, with the residual bits introducing additional overhead later in the vPMU emulation. Fixes: b35e554 ("KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC") Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 3a6de51 upstream. Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES, after running a vCPU, WARN and bug the VM if the PMU is refreshed after the vCPU has run. Note, KVM has disallowed CPUID updates after running a vCPU since commit feb627e ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN"), i.e. PERF_CAPABILITIES was the only remaining way to trigger a PMU refresh after KVM_RUN. [Backport changes] Upstream commit fb3146b adds kvm_vcpu_has_run(), but due to conflicts, the patch is skipped. The API definition is added for backport compatibility. Cc: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit f933b88 upstream. Move the purging of common PMU metadata from intel_pmu_refresh() to kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if the VM is supposed to have a vPMU. KVM already denies access to the PMU based on kvm->arch.enable_pmu, as get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already violates AMD's architecture by not virtualizing a PMU (kernels have long since learned to not panic when the PMU is unavailable). But configuring the PMU as if it were enabled causes unwanted side effects, e.g. calls to kvm_pmu_trigger_event() waste an absurd number of cycles due to the all_valid_pmc_idx bitmap being non-zero. Fixes: b1d66da ("KVM: x86/svm: Add module param to control PMU virtualization") Reported-by: Konstantin Khorenko <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 05519c8 upstream. Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl when reprogramming fixed counters, as truncating the value results in KVM thinking fixed counter 2 is already disabled (the bug also affects fixed counters 3+, but KVM doesn't yet support those). As a result, if the guest disables fixed counter 2, KVM will get a false negative and fail to reprogram/disable emulation of the counter, which can leads to incorrect counts and spurious PMIs in the guest. Fixes: 76d287b ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()") Cc: [email protected] Signed-off-by: Mingwei Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: rewrite changelog to call out the effects of the bug] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 73554b2 upstream. When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a VM-exit that also invokes __kvm_perf_overflow() as a result of instruction emulation, kvm_pmu_deliver_pmi() will be called twice before the next VM-entry. Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that KVM sets the LVTPC mask bit when delivering a PMI. But using IRQ work to trigger the PMI is still broken, albeit very theoretically. E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the vCPU to be migrated to a different pCPU, then it's possible for kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from KVM_REQ_PMI and still generate two PMIs. KVM could set the mask bit using an atomic operation, but that'd just be piling on unnecessary code to workaround what is effectively a hack. The *only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake event, e.g. if the vCPU just executed HLT. Remove the irq_work callback for synthesizing a PMI, and all of the logic for invoking it. Instead, to prevent a vcpu from leaving C0 with a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events(). Fixes: 9cd803d ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Jim Mattson <[email protected]> Tested-by: Mingwei Zhang <[email protected]> Tested-by: Dapeng Mi <[email protected]> Signed-off-by: Mingwei Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage changelog] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 4736d85 upstream. Commit ee3a5f9 ("KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries") moved tweaking of the supplied CPUID data earlier in kvm_set_cpuid() but __kvm_update_cpuid_runtime() actually uses 'vcpu->arch.kvm_cpuid' (though __kvm_find_kvm_cpuid_features()) which gets set later in kvm_set_cpuid(). In some cases, e.g. when kvm_set_cpuid() is called for the first time and 'vcpu->arch.kvm_cpuid' is clear, __kvm_find_kvm_cpuid_features() fails to find KVM PV feature entry and the logic which clears KVM_FEATURE_PV_UNHALT after enabling KVM_X86_DISABLE_EXITS_HLT does not work. The logic, introduced by the commit ee3a5f9 ("KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries") must stay: the supplied CPUID data is tweaked by KVM first (__kvm_update_cpuid_runtime()) and checked later (kvm_check_cpuid()) and the actual data (vcpu->arch.cpuid_*, vcpu->arch.kvm_cpuid, vcpu->arch.xen.cpuid,..) is only updated on success. Switch to searching for KVM_SIGNATURE in the supplied CPUID data to discover KVM PV feature entry instead of using stale 'vcpu->arch.kvm_cpuid'. While on it, drop pointless "&& (best->eax & (1 << KVM_FEATURE_PV_UNHALT)" check when clearing KVM_FEATURE_PV_UNHALT bit. Fixes: ee3a5f9 ("KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries") Reported-and-tested-by: Li RongQing <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit cf8e55f upstream. The CPUID features PDCM, DS and DTES64 are required for PEBS feature. KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS is supported in the KVM on the Ice Lake server platforms. Originally-by: Andi Kleen <[email protected]> Co-developed-by: Kan Liang <[email protected]> Signed-off-by: Kan Liang <[email protected]> Co-developed-by: Luwei Kang <[email protected]> Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 9e985cb upstream. Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. Bug openvelinux#1 is that KVM doesn't account for the upper 32 bits of IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters() stores local variables as u8s and truncates the upper bits too, etc. Bug openvelinux#2 is that, because KVM _always_ sets precise_ip to a non-zero value for PEBS events, perf will _always_ generate an adaptive record, even if the guest requested a basic record. Note, KVM will also enable adaptive PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero, i.e. the guest will only ever see Basic records. Bug openvelinux#3 is in perf. intel_pmu_disable_fixed() doesn't clear the upper bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE either. I.e. perf _always_ enables ADAPTIVE counters, regardless of what KVM requests. Bug #4 is that adaptive PEBS *might* effectively bypass event filters set by the host, as "Updated Memory Access Info Group" records information that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER. Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least zeros) when entering a vCPU with adaptive PEBS, which allows the guest to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries" records. Disable adaptive PEBS support as an immediate fix due to the severity of the LBR leak in particular, and because fixing all of the bugs will be non-trivial, e.g. not suitable for backporting to stable kernels. Note! This will break live migration, but trying to make KVM play nice with live migration would be quite complicated, wouldn't be guaranteed to work (i.e. KVM might still kill/confuse the guest), and it's not clear that there are any publicly available VMMs that support adaptive PEBS, let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't support PEBS in any capacity. [Backport changes] Retain changes in capabilities.h for vmx_get_perf_capabilities() Upstream changes were made in arch/x86/kvm/vmx/vmx.c. For backport compatibility, the changes are applied in arch/x86/kvm/vmx/capabilities.h within vmx_get_perf_capabilities(). Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Fixes: c59a1f1 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Cc: [email protected] Cc: Like Xu <[email protected]> Cc: Mingwei Zhang <[email protected]> Cc: Zhenyu Wang <[email protected]> Cc: Zhang Xiong <[email protected]> Cc: Lv Zhiyuan <[email protected]> Cc: Dapeng Mi <[email protected]> Cc: Jim Mattson <[email protected]> Acked-by: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 1c4dc57 upstream. The braces around the KVM_CAP_XSAVE2 block also surround the KVM_CAP_PMU_CAPABILITY block, likely the result of a merge issue. Simply move the curly brace back to where it belongs. Fixes: ba7bb66 ("KVM: x86: Provide per VM capability for disabling PMU virtualization") Reviewed-by: David Matlack <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

…re limit commit 48639df upstream. A subsequent patch will need to acquire the CPUID leaf range for emulated Xen so explicitly pass the signature of the hypervisor we're interested in to the new function. Also introduce a new kvm_hypervisor_cpuid structure so we can neatly store both the base and limit leaf indices. Signed-off-by: Paul Durrant <[email protected]> Reviewed-by: David Woodhouse <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 92e82cf upstream. Similar to kvm_find_kvm_cpuid_features()/__kvm_find_kvm_cpuid_features(), introduce a helper to search for the specific hypervisor signature in any struct kvm_cpuid_entry2 array, not only in vcpu->arch.cpuid_entries. No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

commit 59cc99f upstream. For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be renamed for reuse by more callers, and remove the comment about LBR use case can be deleted by the way. Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: PvsNarasimha <[email protected]>

sandip4n and others added 30 commits May 9, 2025 15:32

Like Xu and others added 28 commits May 9, 2025 15:32

PvsNarasimha force-pushed the kvm_all_patches branch from 194af2a to e7235ad Compare May 9, 2025 10:04

bhe4 mentioned this pull request Oct 21, 2025

NO MERGE: TEST ONLY: 5.15 velinux combine all PRs. #83

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5.15-velinux] Backported KVM phase 2 patches for velinux-5.15 kernel #47

[5.15-velinux] Backported KVM phase 2 patches for velinux-5.15 kernel #47

Uh oh!

PvsNarasimha commented May 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

[5.15-velinux] Backported KVM phase 2 patches for velinux-5.15 kernel #47

Are you sure you want to change the base?

[5.15-velinux] Backported KVM phase 2 patches for velinux-5.15 kernel #47

Uh oh!

Conversation

PvsNarasimha commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

PvsNarasimha commented May 9, 2025 •

edited

Loading