summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
8 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "s390: - Lots of small and not-so-small fixes for the newly rewritten gmap, mostly affecting the handling of nested guests. x86: - Fix an issue with shadow paging, which causes KVM to install an MMIO PTE in the shadow page tables without first zapping a non-MMIO SPTE if KVM didn't see the write that modified the shadowed guest PTE. While commit a54aa15c6bda3 ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()") was right about it being impossible to miss such a write if it was coming from the guest, it failed to account for writes to guest memory that are outside the scope of KVM: if userspace modifies the guest PTE, and then the guest hits a relevant page fault, KVM will get confused" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE KVM: s390: Fix KVM_S390_VCPU_FAULT ioctl KVM: s390: vsie: Fix guest page tables protection KVM: s390: vsie: Fix unshadowing while shadowing KVM: s390: vsie: Fix refcount overflow for shadow gmaps KVM: s390: vsie: Fix nested guest memory shadowing KVM: s390: Correctly handle guest mappings without struct page KVM: s390: Fix gmap_link() KVM: s390: vsie: Fix check for pre-existing shadow mapping KVM: s390: Remove non-atomic dat_crstep_xchg() KVM: s390: vsie: Fix dat_split_ste()
8 daysMerge tag 'x86-urgent-2026-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix an early boot crash in AMD SEV-SNP guests, caused by incorrect FSGSBASE init ordering (Nikunj A Dadhania) - Remove X86_CR4_FRED from the CR4 pinned bits mask, to fix a race window during the bootup of SEV-{ES,SNP} or TDX guests, which can crash them if they trigger exceptions in that window (Borislav Petkov) - Fix early boot failures on SEV-ES/SNP guests, due to incorrect early GHCB access (Nikunj A Dadhania) - Add clarifying comment to the CRn pinning logic, to avoid future confusion & bugs (Peter Zijlstra) * tag 'x86-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add comment clarifying CRn pinning x86/fred: Fix early boot failures on SEV-ES/SNP guests x86/cpu: Remove X86_CR4_FRED from the CR4 pinned bits mask x86/cpu: Enable FSGSBASE early in cpu_init_exception_handling()
10 daysMerge tag 'efi-fixes-for-v7.0-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: "Fix a potential buffer overrun issue introduced by the previous fix for EFI boot services region reservations on x86" * tag 'efi-fixes-for-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efi: efi_unmap_boot_services: fix calculation of ranges_to_free size
10 daysKVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTESean Christopherson
Adjust KVM's sanity check against overwriting a shadow-present SPTE with a another SPTE with a different target PFN to only apply to direct MMUs, i.e. only to MMUs without shadowed gPTEs. While it's impossible for KVM to overwrite a shadow-present SPTE in response to a guest write, writes from outside the scope of KVM, e.g. from host userspace, aren't detected by KVM's write tracking and so can break KVM's shadow paging rules. ------------[ cut here ]------------ pfn != spte_to_pfn(*sptep) WARNING: arch/x86/kvm/mmu/mmu.c:3069 at mmu_set_spte+0x1e4/0x440 [kvm], CPU#0: vmx_ept_stale_r/872 Modules linked in: kvm_intel kvm irqbypass CPU: 0 UID: 1000 PID: 872 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:mmu_set_spte+0x1e4/0x440 [kvm] Call Trace: <TASK> ept_page_fault+0x535/0x7f0 [kvm] kvm_mmu_do_page_fault+0xee/0x1f0 [kvm] kvm_mmu_page_fault+0x8d/0x620 [kvm] vmx_handle_exit+0x18c/0x5a0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm] kvm_vcpu_ioctl+0x2d5/0x980 [kvm] __x64_sys_ioctl+0x8a/0xd0 do_syscall_64+0xb5/0x730 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 11d45175111d ("KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
10 daysKVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTESean Christopherson
When installing an emulated MMIO SPTE, do so *after* dropping/zapping the existing SPTE (if it's shadow-present). While commit a54aa15c6bda3 was right about it being impossible to convert a shadow-present SPTE to an MMIO SPTE due to a _guest_ write, it failed to account for writes to guest memory that are outside the scope of KVM. E.g. if host userspace modifies a shadowed gPTE to switch from a memslot to emulted MMIO and then the guest hits a relevant page fault, KVM will install the MMIO SPTE without first zapping the shadow-present SPTE. ------------[ cut here ]------------ is_shadow_present_pte(*sptep) WARNING: arch/x86/kvm/mmu/mmu.c:484 at mark_mmio_spte+0xb2/0xc0 [kvm], CPU#0: vmx_ept_stale_r/4292 Modules linked in: kvm_intel kvm irqbypass CPU: 0 UID: 1000 PID: 4292 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:mark_mmio_spte+0xb2/0xc0 [kvm] Call Trace: <TASK> mmu_set_spte+0x237/0x440 [kvm] ept_page_fault+0x535/0x7f0 [kvm] kvm_mmu_do_page_fault+0xee/0x1f0 [kvm] kvm_mmu_page_fault+0x8d/0x620 [kvm] vmx_handle_exit+0x18c/0x5a0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm] kvm_vcpu_ioctl+0x2d5/0x980 [kvm] __x64_sys_ioctl+0x8a/0xd0 do_syscall_64+0xb5/0x730 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x47fa3f </TASK> ---[ end trace 0000000000000000 ]--- Reported-by: Alexander Bulekov <bkov@amazon.com> Debugged-by: Alexander Bulekov <bkov@amazon.com> Suggested-by: Fred Griffoul <fgriffo@amazon.co.uk> Fixes: a54aa15c6bda3 ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-23x86/cpu: Add comment clarifying CRn pinningPeter Zijlstra
To avoid future confusion on the purpose and design of the CRn pinning code. Also note that if the attacker controls page-tables, the CRn bits lose much of the attraction anyway. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20260320092521.GG3739106@noisy.programming.kicks-ass.net
2026-03-23x86/fred: Fix early boot failures on SEV-ES/SNP guestsNikunj A Dadhania
FRED-enabled SEV-(ES,SNP) guests fail to boot due to the following issues in the early boot sequence: * FRED does not have a #VC exception handler in the dispatch logic * Early FRED #VC exceptions attempt to use uninitialized per-CPU GHCBs instead of boot_ghcb Add X86_TRAP_VC case to fred_hwexc() with a new exc_vmm_communication() function that provides the unified entry point FRED requires, dispatching to existing user/kernel handlers based on privilege level. The function is already declared via DECLARE_IDTENTRY_VC(). Fix early GHCB access by falling back to boot_ghcb in __sev_{get,put}_ghcb() when per-CPU GHCBs are not yet initialized. Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> # 6.12+ Link: https://patch.msgid.link/20260318075654.1792916-4-nikunj@amd.com
2026-03-23x86/cpu: Remove X86_CR4_FRED from the CR4 pinned bits maskBorislav Petkov (AMD)
Commit in Fixes added the FRED CR4 bit to the CR4 pinned bits mask so that whenever something else modifies CR4, that bit remains set. Which in itself is a perfectly fine idea. However, there's an issue when during boot FRED is initialized: first on the BSP and later on the APs. Thus, there's a window in time when exceptions cannot be handled. This becomes particularly nasty when running as SEV-{ES,SNP} or TDX guests which, when they manage to trigger exceptions during that short window described above, triple fault due to FRED MSRs not being set up yet. See Link tag below for a much more detailed explanation of the situation. So, as a result, the commit in that Link URL tried to address this shortcoming by temporarily disabling CR4 pinning when an AP is not online yet. However, that is a problem in itself because in this case, an attack on the kernel needs to only modify the online bit - a single bit in RW memory - and then disable CR4 pinning and then disable SM*P, leading to more and worse things to happen to the system. So, instead, remove the FRED bit from the CR4 pinning mask, thus obviating the need to temporarily disable CR4 pinning. If someone manages to disable FRED when poking at CR4, then idt_invalidate() would make sure the system would crash'n'burn on the first exception triggered, which is a much better outcome security-wise. Fixes: ff45746fbf00 ("x86/cpu: Add X86_CR4_FRED macro") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> # 6.12+ Link: https://lore.kernel.org/r/177385987098.1647592.3381141860481415647.tip-bot2@tip-bot2
2026-03-23x86/cpu: Enable FSGSBASE early in cpu_init_exception_handling()Nikunj A Dadhania
Move FSGSBASE enablement from identify_cpu() to cpu_init_exception_handling() to ensure it is enabled before any exceptions can occur on both boot and secondary CPUs. == Background == Exception entry code (paranoid_entry()) uses ALTERNATIVE patching based on X86_FEATURE_FSGSBASE to decide whether to use RDGSBASE/WRGSBASE instructions or the slower RDMSR/SWAPGS sequence for saving/restoring GSBASE. On boot CPU, ALTERNATIVE patching happens after enabling FSGSBASE in CR4. When the feature is available, the code is permanently patched to use RDGSBASE/WRGSBASE, which require CR4.FSGSBASE=1 to execute without triggering == Boot Sequence == Boot CPU (with CR pinning enabled): trap_init() cpu_init() <- Uses unpatched code (RDMSR/SWAPGS) x2apic_setup() ... arch_cpu_finalize_init() identify_boot_cpu() identify_cpu() cr4_set_bits(X86_CR4_FSGSBASE) # Enables the feature # This becomes part of cr4_pinned_bits ... alternative_instructions() <- Patches code to use RDGSBASE/WRGSBASE Secondary CPUs (with CR pinning enabled): start_secondary() cr4_init() <- Code already patched, CR4.FSGSBASE=1 set implicitly via cr4_pinned_bits cpu_init() <- exceptions work because FSGSBASE is already enabled Secondary CPU (with CR pinning disabled): start_secondary() cr4_init() <- Code already patched, CR4.FSGSBASE=0 cpu_init() x2apic_setup() rdmsrq(MSR_IA32_APICBASE) <- Triggers #VC in SNP guests exc_vmm_communication() paranoid_entry() <- Uses RDGSBASE with CR4.FSGSBASE=0 (patched code) ... ap_starting() identify_secondary_cpu() identify_cpu() cr4_set_bits(X86_CR4_FSGSBASE) <- Enables the feature, which is too late == CR Pinning == Currently, for secondary CPUs, CR4.FSGSBASE is set implicitly through CR-pinning: the boot CPU sets it during identify_cpu(), it becomes part of cr4_pinned_bits, and cr4_init() applies those pinned bits to secondary CPUs. This works but creates an undocumented dependency between cr4_init() and the pinning mechanism. == Problem == Secondary CPUs boot after alternatives have been applied globally. They execute already-patched paranoid_entry() code that uses RDGSBASE/WRGSBASE instructions, which require CR4.FSGSBASE=1. Upcoming changes to CR pinning behavior will break the implicit dependency, causing secondary CPUs to generate #UD. This issue manifests itself on AMD SEV-SNP guests, where the rdmsrq() in x2apic_setup() triggers a #VC exception early during cpu_init(). The #VC handler (exc_vmm_communication()) executes the patched paranoid_entry() path. Without CR4.FSGSBASE enabled, RDGSBASE instructions trigger #UD. == Fix == Enable FSGSBASE explicitly in cpu_init_exception_handling() before loading exception handlers. This makes the dependency explicit and ensures both boot and secondary CPUs have FSGSBASE enabled before paranoid_entry() executes. Fixes: c82965f9e530 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit") Reported-by: Borislav Petkov <bp@alien8.de> Suggested-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/20260318075654.1792916-2-nikunj@amd.com
2026-03-22Merge tag 'x86-urgent-2026-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Improve Qemu MCE-injection behavior by only using AMD SMCA MSRs if the feature bit is set - Fix the relative path of gettimeofday.c inclusion in vclock_gettime.c - Fix a boot crash on UV clusters when a socket is marked as 'deconfigured' which are mapped to the SOCK_EMPTY node ID by the UV firmware, while Linux APIs expect NUMA_NO_NODE. The difference being (0xffff [unsigned short ~0]) vs [int -1] * tag 'x86-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/uv: Handle deconfigured sockets x86/entry/vdso: Fix path of included gettimeofday.c x86/mce/amd: Check SMCA feature bit before accessing SMCA MSRs
2026-03-22Merge tag 'perf-urgent-2026-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: - Fix a PMU driver crash on AMD EPYC systems, caused by a race condition in x86_pmu_enable() - Fix a possible counter-initialization bug in x86_pmu_enable() - Fix a counter inheritance bug in inherit_event() and __perf_event_read() - Fix an Intel PMU driver branch constraints handling bug found by UBSAN - Fix the Intel PMU driver's new Off-Module Response (OMR) support code for Diamond Rapids / Nova lake, to fix a snoop information parsing bug * tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix OMR snoop information parsing issues perf/x86/intel: Add missing branch counters constraint apply perf: Make sure to use pmu_ctx->pmu for groups x86/perf: Make sure to program the counter value for stopped events on migration perf/x86: Move event pointer setup earlier in x86_pmu_enable()
2026-03-20x86/platform/uv: Handle deconfigured socketsKyle Meyer
When a socket is deconfigured, it's mapped to SOCK_EMPTY (0xffff). This causes a panic while allocating UV hub info structures. Fix this by using NUMA_NO_NODE, allowing UV hub info structures to be allocated on valid nodes. Fixes: 8a50c5851927 ("x86/platform/uv: UV support for sub-NUMA clustering") Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/ab2BmGL0ehVkkjKk@hpe.com
2026-03-20x86/entry/vdso: Fix path of included gettimeofday.cVladimir Oltean
Commit in Fixes forgot to convert one include path to be relative to the kernel source directory after adding latter to flags-y. Fix it. [ bp: Rewrite commit message. ] Fixes: 693c819fedcd ("x86/entry/vdso: Refactor the vdso build") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20260307174406.1808981-1-vladimir.oltean@nxp.com
2026-03-20Merge tag 'hyperv-fixes-signed-20260319' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull Hyper-V fixes from Wei Liu: - Fix ARM64 MSHV support (Anirudh Rayabharam) - Fix MSHV driver memory handling issues (Stanislav Kinsburskii) - Update maintainers for Hyper-V DRM driver (Saurabh Sengar) - Misc clean up in MSHV crashdump code (Ard Biesheuvel, Uros Bizjak) - Minor improvements to MSHV code (Mukesh R, Wei Liu) - Revert not yet released MSHV scrub partition hypercall (Wei Liu) * tag 'hyperv-fixes-signed-20260319' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: mshv: Fix error handling in mshv_region_pin MAINTAINERS: Update maintainers for Hyper-V DRM driver mshv: Fix use-after-free in mshv_map_user_memory error path mshv: pass struct mshv_user_mem_region by reference x86/hyperv: Use any general-purpose register when saving %cr2 and %cr8 x86/hyperv: Use current_stack_pointer to avoid asm() in hv_hvcrash_ctxt_save() x86/hyperv: Save segment registers directly to memory in hv_hvcrash_ctxt_save() x86/hyperv: Use __naked attribute to fix stackless C function Revert "mshv: expose the scrub partition hypercall" mshv: add arm64 support for doorbell & intercept SINTs mshv: refactor synic init and cleanup x86/hyperv: print out reserved vectors in hexadecimal
2026-03-20x86/efi: efi_unmap_boot_services: fix calculation of ranges_to_free sizeMike Rapoport (Microsoft)
ranges_to_free array should have enough room to store the entire EFI memmap plus an extra element for NULL entry. The calculation of this array size wrongly adds 1 to the overall size instead of adding 1 to the number of elements. Add parentheses to properly size the array. Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: a4b0bf6a40f3 ("x86/efi: defer freeing of boot services memory") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2026-03-18x86/mce/amd: Check SMCA feature bit before accessing SMCA MSRsWilliam Roche
People do effort to inject MCEs into guests in order to simulate/test handling of hardware errors. The real use case behind it is testing the handling of SIGBUS which the memory failure code sends to the process. If that process is QEMU, instead of killing the whole guest, the MCE can be injected into the guest kernel so that latter can attempt proper handling and kill the user *process* in the guest, instead, which caused the MCE. The assumption being here that the whole injection flow can supply enough information that the guest kernel can pinpoint the right process. But that's a different topic... Regardless of virtualization or not, access to SMCA-specific registers like MCA_DESTAT should only be done after having checked the smca feature bit. And there are AMD machines like Bulldozer (the one before Zen1) which do support deferred errors but are not SMCA machines. Therefore, properly check the feature bit before accessing related MSRs. [ bp: Rewrite commit message. ] Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling") Signed-off-by: William Roche <william.roche@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20260218163025.1316501-1-william.roche@oracle.com
2026-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Quite a large pull request, partly due to skipping last week and therefore having material from ~all submaintainers in this one. About a fourth of it is a new selftest, and a couple more changes are large in number of files touched (fixing a -Wflex-array-member-not-at-end compiler warning) or lines changed (reformatting of a table in the API documentation, thanks rST). But who am I kidding---it's a lot of commits and there are a lot of bugs being fixed here, some of them on the nastier side like the RISC-V ones. ARM: - Correctly handle deactivation of interrupts that were activated from LRs. Since EOIcount only denotes deactivation of interrupts that are not present in an LR, start EOIcount deactivation walk *after* the last irq that made it into an LR - Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM is already enabled -- not only thhis isn't possible (pKVM will reject the call), but it is also useless: this can only happen for a CPU that has already booted once, and the capability will not change - Fix a couple of low-severity bugs in our S2 fault handling path, affecting the recently introduced LS64 handling and the even more esoteric handling of hwpoison in a nested context - Address yet another syzkaller finding in the vgic initialisation, where we would end-up destroying an uninitialised vgic with nasty consequences - Address an annoying case of pKVM failing to boot when some of the memblock regions that the host is faulting in are not page-aligned - Inject some sanity in the NV stage-2 walker by checking the limits against the advertised PA size, and correctly report the resulting faults PPC: - Fix a PPC e500 build error due to a long-standing wart that was exposed by the recent conversion to kmalloc_obj(); rip out all the ugliness that led to the wart RISC-V: - Prevent speculative out-of-bounds access using array_index_nospec() in APLIC interrupt handling, ONE_REG regiser access, AIA CSR access, float register access, and PMU counter access - Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(), kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr() - Fix potential null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei() - Fix off-by-one array access in SBI PMU - Skip THP support check during dirty logging - Fix error code returned for Smstateen and Ssaia ONE_REG interface - Check host Ssaia extension when creating AIA irqchip x86: - Fix cases where CPUID mitigation features were incorrectly marked as available whenever the kernel used scattered feature words for them - Validate _all_ GVAs, rather than just the first GVA, when processing a range of GVAs for Hyper-V's TLB flush hypercalls - Fix a brown paper bug in add_atomic_switch_msr() - Use hlist_for_each_entry_srcu() when traversing mask_notifier_list, to fix a lockdep warning; KVM doesn't hold RCU, just irq_srcu - Ensure AVIC VMCB fields are initialized if the VM has an in-kernel local APIC (and AVIC is enabled at the module level) - Update CR8 write interception when AVIC is (de)activated, to fix a bug where the guest can run in perpetuity with the CR8 intercept enabled - Add a quirk to skip the consistency check on FREEZE_IN_SMM, i.e. to allow L1 hypervisors to set FREEZE_IN_SMM. This reverts (by default) an unintentional tightening of userspace ABI in 6.17, and provides some amount of backwards compatibility with hypervisors who want to freeze PMCs on VM-Entry - Validate the VMCS/VMCB on return to a nested guest from SMM, because either userspace or the guest could stash invalid values in memory and trigger the processor's consistency checks Generic: - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive Selftests: - Increase the maximum number of NUMA nodes in the guest_memfd selftest to 64 (from 8)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits) KVM: selftests: Verify SEV+ guests can read and write EFER, CR0, CR4, and CR8 Documentation: kvm: fix formatting of the quirks table KVM: x86: clarify leave_smm() return value selftests: kvm: add a test that VMX validates controls on RSM selftests: kvm: extract common functionality out of smm_test.c KVM: SVM: check validity of VMCB controls when returning from SMM KVM: VMX: check validity of VMCS controls when returning from SMM KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APIC KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM KVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers() KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr() KVM: x86: hyper-v: Validate all GVAs during PV TLB flush KVM: x86: synthesize CPUID bits only if CPU capability is set KVM: PPC: e500: Rip out "struct tlbe_ref" KVM: PPC: e500: Fix build error due to using kmalloc_obj() with wrong type KVM: selftests: Increase 'maxnode' for guest_memfd tests KVM: arm64: pkvm: Don't reprobe for ICH_VTR_EL2.TDS on CPU hotplug KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tail KVM: arm64: Remove the redundant ISB in __kvm_at_s1e2() ...
2026-03-12perf/x86/intel: Fix OMR snoop information parsing issuesDapeng Mi
When omr_source is 0x2, the omr_snoop (bit[6]) and omr_promoted (bit[7]) fields are combined to represent the snoop information. However, the omr_promoted field was not left-shifted by 1 bit, resulting in incorrect snoop information. Besides, the snoop information parsing is not accurate for some OMR sources, like the snoop information should be SNOOP_NONE for these memory access (omr_source >= 7) instead of SNOOP_HIT. Fix these issues. Closes: https://lore.kernel.org/all/CAP-5=fW4zLWFw1v38zCzB9-cseNSTTCtup=p2SDxZq7dPayVww@mail.gmail.com/ Fixes: d2bdcde9626c ("perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR") Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/20260311075201.2951073-1-dapeng1.mi@linux.intel.com
2026-03-12perf/x86/intel: Add missing branch counters constraint applyDapeng Mi
When running the command: 'perf record -e "{instructions,instructions:p}" -j any,counter sleep 1', a "shift-out-of-bounds" warning is reported on CWF. UBSAN: shift-out-of-bounds in /kbuild/src/consumer/arch/x86/events/intel/lbr.c:970:15 shift exponent 64 is too large for 64-bit type 'long long unsigned int' ...... intel_pmu_lbr_counters_reorder.isra.0.cold+0x2a/0xa7 intel_pmu_lbr_save_brstack+0xc0/0x4c0 setup_arch_pebs_sample_data+0x114b/0x2400 The warning occurs because the second "instructions:p" event, which involves branch counters sampling, is incorrectly programmed to fixed counter 0 instead of the general-purpose (GP) counters 0-3 that support branch counters sampling. Currently only GP counters 0-3 support branch counters sampling on CWF, any event involving branch counters sampling should be programed on GP counters 0-3. Since the counter index of fixed counter 0 is 32, it leads to the "src" value in below code is right shifted 64 bits and trigger the "shift-out-of-bounds" warning. cnt = (src >> (order[j] * LBR_INFO_BR_CNTR_BITS)) & LBR_INFO_BR_CNTR_MASK; The root cause is the loss of the branch counters constraint for the new event in the branch counters sampling event group. Since it isn't yet part of the sibling list. This results in the second "instructions:p" event being programmed on fixed counter 0 incorrectly instead of the appropriate GP counters 0-3. To address this, we apply the missing branch counters constraint for the last event in the group. Additionally, we introduce a new function, `intel_set_branch_counter_constr()`, to apply the branch counters constraint and avoid code duplication. Fixes: 33744916196b ("perf/x86/intel: Support branch counters logging") Reported-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260228053320.140406-2-dapeng1.mi@linux.intel.com Cc: stable@vger.kernel.org
2026-03-12x86/perf: Make sure to program the counter value for stopped events on migrationPeter Zijlstra
Both Mi Dapeng and Ian Rogers noted that not everything that sets HES_STOPPED is required to EF_UPDATE. Specifically the 'step 1' loop of rescheduling explicitly does EF_UPDATE to ensure the counter value is read. However, then 'step 2' simply leaves the new counter uninitialized when HES_STOPPED, even though, as noted above, the thing that stopped them might not be aware it needs to EF_RELOAD -- since it didn't EF_UPDATE on stop. One such location that is affected is throttling, throttle does pmu->stop(, 0); and unthrottle does pmu->start(, 0); possibly restarting an uninitialized counter. Fixes: a4eaf7f14675 ("perf: Rework the PMU methods") Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260311204035.GX606826@noisy.programming.kicks-ass.net
2026-03-12perf/x86: Move event pointer setup earlier in x86_pmu_enable()Breno Leitao
A production AMD EPYC system crashed with a NULL pointer dereference in the PMU NMI handler: BUG: kernel NULL pointer dereference, address: 0000000000000198 RIP: x86_perf_event_update+0xc/0xa0 Call Trace: <NMI> amd_pmu_v2_handle_irq+0x1a6/0x390 perf_event_nmi_handler+0x24/0x40 The faulting instruction is `cmpq $0x0, 0x198(%rdi)` with RDI=0, corresponding to the `if (unlikely(!hwc->event_base))` check in x86_perf_event_update() where hwc = &event->hw and event is NULL. drgn inspection of the vmcore on CPU 106 showed a mismatch between cpuc->active_mask and cpuc->events[]: active_mask: 0x1e (bits 1, 2, 3, 4) events[1]: 0xff1100136cbd4f38 (valid) events[2]: 0x0 (NULL, but active_mask bit 2 set) events[3]: 0xff1100076fd2cf38 (valid) events[4]: 0xff1100079e990a90 (valid) The event that should occupy events[2] was found in event_list[2] with hw.idx=2 and hw.state=0x0, confirming x86_pmu_start() had run (which clears hw.state and sets active_mask) but events[2] was never populated. Another event (event_list[0]) had hw.state=0x7 (STOPPED|UPTODATE|ARCH), showing it was stopped when the PMU rescheduled events, confirming the throttle-then-reschedule sequence occurred. The root cause is commit 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss") which moved the cpuc->events[idx] assignment out of x86_pmu_start() and into step 2 of x86_pmu_enable(), after the PERF_HES_ARCH check. This broke any path that calls pmu->start() without going through x86_pmu_enable() -- specifically the unthrottle path: perf_adjust_freq_unthr_events() -> perf_event_unthrottle_group() -> perf_event_unthrottle() -> event->pmu->start(event, 0) -> x86_pmu_start() // sets active_mask but not events[] The race sequence is: 1. A group of perf events overflows, triggering group throttle via perf_event_throttle_group(). All events are stopped: active_mask bits cleared, events[] preserved (x86_pmu_stop no longer clears events[] after commit 7e772a93eb61). 2. While still throttled (PERF_HES_STOPPED), x86_pmu_enable() runs due to other scheduling activity. Stopped events that need to move counters get PERF_HES_ARCH set and events[old_idx] cleared. In step 2 of x86_pmu_enable(), PERF_HES_ARCH causes these events to be skipped -- events[new_idx] is never set. 3. The timer tick unthrottles the group via pmu->start(). Since commit 7e772a93eb61 removed the events[] assignment from x86_pmu_start(), active_mask[new_idx] is set but events[new_idx] remains NULL. 4. A PMC overflow NMI fires. The handler iterates active counters, finds active_mask[2] set, reads events[2] which is NULL, and crashes dereferencing it. Move the cpuc->events[hwc->idx] assignment in x86_pmu_enable() to before the PERF_HES_ARCH check, so that events[] is populated even for events that are not immediately started. This ensures the unthrottle path via pmu->start() always finds a valid event pointer. Fixes: 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310-perf-v2-1-4a3156fce43c@debian.org
2026-03-12x86/hyperv: Use any general-purpose register when saving %cr2 and %cr8Uros Bizjak
hv_hvcrash_ctxt_save() in arch/x86/hyperv/hv_crash.c currently saves %cr2 and %cr8 using %eax ("=a"). This unnecessarily forces a specific register. Update the inline assembly to use a general-purpose register ("=r") for both %cr2 and %cr8. This makes the code more flexible for the compiler while producing the same saved context contents. No functional changes. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Dexuan Cui <decui@microsoft.com> Cc: Long Li <longli@microsoft.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-03-12x86/hyperv: Use current_stack_pointer to avoid asm() in hv_hvcrash_ctxt_save()Uros Bizjak
Use current_stack_pointer to avoid asm() when saving %rsp to the crash context memory in hv_hvcrash_ctxt_save(). The new code is more readable and results in exactly the same object file. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Dexuan Cui <decui@microsoft.com> Cc: Long Li <longli@microsoft.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-03-12x86/hyperv: Save segment registers directly to memory in hv_hvcrash_ctxt_save()Uros Bizjak
hv_hvcrash_ctxt_save() in arch/x86/hyperv/hv_crash.c currently saves segment registers via a general-purpose register (%eax). Update the code to save segment registers (cs, ss, ds, es, fs, gs) directly to the crash context memory using movw. This avoids unnecessary use of a general-purpose register, making the code simpler and more efficient. The size of the corresponding object file improves as follows: text data bss dec hex filename 4167 176 200 4543 11bf hv_crash-old.o 4151 176 200 4527 11af hv_crash-new.o No functional change occurs to the saved context contents; this is purely a code-quality improvement. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Dexuan Cui <decui@microsoft.com> Cc: Long Li <longli@microsoft.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-03-11KVM: x86: clarify leave_smm() return valuePaolo Bonzini
The return value of vmx_leave_smm() is unrelated from that of nested_vmx_enter_non_root_mode(). Check explicitly for success (which happens to be 0) and return 1 just like everywhere else in vmx_leave_smm(). Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno returned by tenter_svm_guest_mode(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: SVM: check validity of VMCB controls when returning from SMMPaolo Bonzini
The VMCB12 is stored in guest memory and can be mangled while in SMM; it is then reloaded by svm_leave_smm(), but it is not checked again for validity. Move the cached vmcb12 control and save consistency checks out of svm_set_nested_state() and into a helper, and reuse it in svm_leave_smm(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: VMX: check validity of VMCS controls when returning from SMMPaolo Bonzini
The VMCS12 is not available while in SMM. However, it can be overwritten if userspace manages to trigger copy_enlightened_to_vmcs12() - for example via KVM_GET_NESTED_STATE. Because of this, the VMCS12 has to be checked for validity before it is used to generate the VMCS02. Move the check code out of vmx_set_nested_state() (the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry) and reuse it in vmx_leave_smm(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activatedSean Christopherson
Explicitly set/clear CR8 write interception when AVIC is (de)activated to fix a bug where KVM leaves the interception enabled after AVIC is activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8 will remain intercepted in perpetuity. On its own, the dangling CR8 intercept is "just" a performance issue, but combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging intercept is fatal to Windows guests as the TPR seen by hardware gets wildly out of sync with reality. Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this is firmly an SVM implementation flaw/detail. WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should never enter the guest with AVIC enabled and CR8 writes intercepted. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Cc: Naveen N Rao (AMD) <naveen@kernel.org> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> [Squash fix to avic_deactivate_vmcb. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APICSean Christopherson
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the vCPU could activate AVIC at any point in its lifecycle. Configuring the VMCB if and only if AVIC is active "works" purely because of optimizations in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled *and* to defer updates until the first KVM_RUN. In quotes because KVM likely won't do the right thing if kvm_apicv_activated() is false, i.e. if a vCPU is created while APICv is inhibited at the VM level for whatever reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to vendor code due to seeing "apicv_active == activate". Cleaning up the initialization code will also allow fixing a bug where KVM incorrectly leaves CR8 interception enabled when AVIC is activated without creating a mess with respect to whether AVIC is activated or not. Cc: stable@vger.kernel.org Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC") Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv") Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMMJim Mattson
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted prior to commit 6b1dd26544d0 ("KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest"). Enable the quirk by default for backwards compatibility (like all quirks); userspace can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the constraints on WRMSR(IA32_DEBUGCTL). Note that the quirk only bypasses the consistency check. The vmcs02 bit is still owned by the host, and PMCs are not frozen during virtualized SMM. In particular, if a host administrator decides that PMCs should not be frozen during physical SMM, then L1 has no say in the matter. Fixes: 095686e6fcb4 ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter") Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com [sean: tag for stable@, clean-up and fix goofs in the comment and docs] Signed-off-by: Sean Christopherson <seanjc@google.com> [Rename quirk. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers()Li RongQing
The mask_notifier_list is protected by kvm->irq_srcu, but the traversal in kvm_fire_mask_notifiers() incorrectly uses hlist_for_each_entry_rcu(). This leads to lockdep warnings because the standard RCU iterator expects to be under rcu_read_lock(), not SRCU. Replace the RCU variant with hlist_for_each_entry_srcu() and provide the proper srcu_read_lock_held() annotation to ensure correct synchronization and silence lockdep. Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20260204091206.2617-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr()Namhyung Kim
The previous change had a bug to update a guest MSR with a host value. Fixes: c3d6a7210a4de9096 ("KVM: VMX: Dedup code for adding MSR to VMCS's auto list") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260220220216.389475-1-namhyung@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: x86: hyper-v: Validate all GVAs during PV TLB flushManuel Andreas
In KVM guests with Hyper-V hypercalls enabled, the hypercalls HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX allow a guest to request invalidation of portions of a virtual TLB. For this, the hypercall parameter includes a list of GVAs that are supposed to be invalidated. Currently, only the base GVA is checked to be canonical. In reality, this check needs to be performed for the entire range of GVAs, as checking only the base GVA enables guests running on Intel hardware to trigger a WARN_ONCE in the host (see Fixes commit below). Move the check for non-canonical addresses to be performed for every GVA of the supplied range to avoid the splat, and to be more in line with the Hyper-V specification, since, although unlikely, a range starting with an invalid GVA may still contain GVAs that are valid. Fixes: fa787ac07b3c ("KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush") Signed-off-by: Manuel Andreas <manuel.andreas@tum.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/00a7a31b-573b-4d92-91f8-7d7e2f88ea48@tum.de [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11KVM: x86: synthesize CPUID bits only if CPU capability is setCarlos López
KVM incorrectly synthesizes CPUID bits for KVM-only leaves, as the following branch in kvm_cpu_cap_init() is never taken: if (leaf < NCAPINTS) kvm_cpu_caps[leaf] &= kernel_cpu_caps[leaf]; This means that bits set via SYNTHESIZED_F() for KVM-only leaves are unconditionally set. This for example can cause issues for SEV-SNP guests running on Family 19h CPUs, as TSA_SQ_NO and TSA_L1_NO are always enabled by KVM in 80000021[ECX]. When userspace issues a SNP_LAUNCH_UPDATE command to update the CPUID page for the guest, SNP firmware will explicitly reject the command if the page sets sets these bits on vulnerable CPUs. To fix this, check in SYNTHESIZED_F() that the corresponding X86 capability is set before adding it to to kvm_cpu_cap_features. Fixes: 31272abd5974 ("KVM: SVM: Advertise TSA CPUID bits to guests") Link: https://lore.kernel.org/all/20260208164233.30405-1-clopez@suse.de/ Signed-off-by: Carlos López <clopez@suse.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://patch.msgid.link/20260209153108.70667-2-clopez@suse.de Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11Merge tag 'kvm-x86-generic-7.0-rc3' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini
HEAD KVM generic changes for 7.0 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive.
2026-03-11x86/hyperv: Use __naked attribute to fix stackless C functionArd Biesheuvel
hv_crash_c_entry() is a C function that is entered without a stack, and this is only allowed for functions that have the __naked attribute, which informs the compiler that it must not emit the usual prologue and epilogue or emit any other kind of instrumentation that relies on a stack frame. So split up the function, and set the __naked attribute on the initial part that sets up the stack, GDT, IDT and other pieces that are needed for ordinary C execution. Given that function calls are not permitted either, use the existing long return coded in an asm() block to call the second part of the function, which is an ordinary function that is permitted to call other functions as usual. Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> # asm parts, not hv parts Reviewed-by: Mukesh Rathor <mrathor@linux.microsoft.com> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: linux-hyperv@vger.kernel.org Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-03-10x86/apic: Disable x2apic on resume if the kernel expects soShashank Balaji
When resuming from s2ram, firmware may re-enable x2apic mode, which may have been disabled by the kernel during boot either because it doesn't support IRQ remapping or for other reasons. This causes the kernel to continue using the xapic interface, while the hardware is in x2apic mode, which causes hangs. This happens on defconfig + bare metal + s2ram. Fix this in lapic_resume() by disabling x2apic if the kernel expects it to be disabled, i.e. when x2apic_mode = 0. The ACPI v6.6 spec, Section 16.3 [1] says firmware restores either the pre-sleep configuration or initial boot configuration for each CPU, including MSR state: When executing from the power-on reset vector as a result of waking from an S2 or S3 sleep state, the platform firmware performs only the hardware initialization required to restore the system to either the state the platform was in prior to the initial operating system boot, or to the pre-sleep configuration state. In multiprocessor systems, non-boot processors should be placed in the same state as prior to the initial operating system boot. (further ahead) If this is an S2 or S3 wake, then the platform runtime firmware restores minimum context of the system before jumping to the waking vector. This includes: CPU configuration. Platform runtime firmware restores the pre-sleep configuration or initial boot configuration of each CPU (MSR, MTRR, firmware update, SMBase, and so on). Interrupts must be disabled (for IA-32 processors, disabled by CLI instruction). (and other things) So at least as per the spec, re-enablement of x2apic by the firmware is allowed if "x2apic on" is a part of the initial boot configuration. [1] https://uefi.org/specs/ACPI/6.6/16_Waking_and_Sleeping.html#initialization [ bp: Massage. ] Fixes: 6e1cb38a2aef ("x64, x2apic/intr-remap: add x2apic support, including enabling interrupt-remapping") Co-developed-by: Rahul Bukte <rahul.bukte@sony.com> Signed-off-by: Rahul Bukte <rahul.bukte@sony.com> Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260306-x2apic-fix-v2-1-bee99c12efa3@sony.com
2026-03-08Merge tag 'efi-fixes-for-v7.0-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: "Fix for the x86 EFI workaround keeping boot services code and data regions reserved until after SetVirtualAddressMap() completes: deferred struct page initialization may result in some of this memory being lost permanently" * tag 'efi-fixes-for-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efi: defer freeing of boot services memory
2026-03-07Merge tag 'x86-urgent-2026-03-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix SEV guest boot failures in certain circumstances, due to very early code relying on a BSS-zeroed variable that isn't actually zeroed yet an may contain non-zero bootup values Move the variable into the .data section go gain even earlier zeroing - Expose & allow the IBPB-on-Entry feature on SNP guests, which was not properly exposed to guests due to initial implementational caution - Fix O= build failure when CONFIG_EFI_SBAT_FILE is using relative file paths - Fix the various SNC (Sub-NUMA Clustering) topology enumeration bugs/artifacts (sched-domain build errors mostly). SNC enumeration data got more complicated with Granite Rapids X (GNR) and Clearwater Forest X (CWF), which exposed these bugs and made their effects more serious - Also use the now sane(r) SNC code to fix resctrl SNC detection bugs - Work around a historic libgcc unwinder bug in the vdso32 sigreturn code (again), which regressed during an overly aggressive recent cleanup of DWARF annotations * tag 'x86-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/vdso32: Work around libgcc unwinder bug x86/resctrl: Fix SNC detection x86/topo: Fix SNC topology mess x86/topo: Replace x86_has_numa_in_package x86/topo: Add topology_num_nodes_per_package() x86/numa: Store extra copy of numa_nodes_parsed x86/boot: Handle relative CONFIG_EFI_SBAT_FILE file paths x86/sev: Allow IBPB-on-Entry feature for SNP guests x86/boot/sev: Move SEV decompressor variables into the .data section
2026-03-07Merge tag 'for-linus-7.0-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a cleanup of arch/x86/kernel/head_64.S removing the pre-built page tables for Xen guests - a small comment update - another cleanup for Xen PVH guests mode - fix an issue with Xen PV-devices backed by driver domains * tag 'for-linus-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/xenbus: better handle backend crash xenbus: add xenbus_device parameter to xenbus_read_driver_state() x86/PVH: Use boot params to pass RSDP address in start_info page x86/xen: update outdated comment xen/acpi-processor: fix _CST detection using undersized evaluation buffer x86/xen: Build identity mapping page tables dynamically for XENPV
2026-03-06Merge tag 'kbuild-fixes-7.0-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux Pull Kbuild fixes from Nathan Chancellor: - Split out .modinfo section from ELF_DETAILS macro, as that macro may be used in other areas that expect to discard .modinfo, breaking certain image layouts - Adjust genksyms parser to handle optional attributes in certain declarations, necessary after commit 07919126ecfc ("netfilter: annotate NAT helper hook pointers with __rcu") - Include resolve_btfids in external module build created by scripts/package/install-extmod-build when it may be run on external modules - Avoid removing objtool binary with 'make clean', as it is required for external module builds * tag 'kbuild-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: kbuild: Leave objtool binary around with 'make clean' kbuild: install-extmod-build: Package resolve_btfids if necessary genksyms: Fix parsing a declarator with a preceding attribute kbuild: Split .modinfo out from ELF_DETAILS
2026-03-04x86/entry/vdso32: Work around libgcc unwinder bugH. Peter Anvin
The unwinder code in libgcc has a long standing bug which causes it to fail to pick up the signal frame CFI flag. This is a generic bug across all platforms. It affects the __kernel_sigreturn and __kernel_rt_sigreturn vdso entry points on i386. The x86-64 kernel doesn't provide a sigreturn stub, and so there is no kernel-provided code that is affected on x86-64. libgcc does have a legacy fallback path which happens to work as long as the bytes immediately before each of the sigreturn functions fall outside any function. This patch adds a nop before the ALIGN to each of the sigreturn stubs to ensure that this is, indeed, the case. The rest of the patch is just a comment which documents the invariants that need to be maintained for this legacy path to work correctly. This is a manifest bug: in the current vdso, __kernel_vsyscall is a multiple of 16 bytes long and thus __kernel_sigreturn does not have any padding in front of it. Closes: https://lore.kernel.org/lkml/f3412cc3e8f66d1853cc9d572c0f2fab076872b1.camel@xry111.site Fixes: 884961618ee5 ("x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S") Reported-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124050 Link: https://patch.msgid.link/20260227010308.310342-1-hpa@zytor.com
2026-03-04x86/resctrl: Fix SNC detectionTony Luck
Now that the x86 topology code has a sensible nodes-per-package measure, that does not depend on the online status of CPUs, use this to divinate the SNC mode. Note that when Cluster on Die (CoD) is configured on older systems this will also show multiple NUMA nodes per package. Intel Resource Director Technology is incomaptible with CoD. Print a warning and do not use the fixup MSR_RMID_SNC_CONFIG. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Link: https://patch.msgid.link/aaCxbbgjL6OZ6VMd@agluck-desk3 Link: https://patch.msgid.link/20260303110100.367976706@infradead.org
2026-03-04x86/topo: Fix SNC topology messPeter Zijlstra
Per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode"), the original crazy SNC-3 SLIT table was: node distances: node 0 1 2 3 4 5 0: 10 15 17 21 28 26 1: 15 10 15 23 26 23 2: 17 15 10 26 23 21 3: 21 28 26 10 15 17 4: 23 26 23 15 10 15 5: 26 23 21 17 15 10 And per: https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/ The suggestion was to average the off-trace clusters to restore sanity. However, 4d6dd05d07d0 implements this under various assumptions: - anything GNR/CWF with numa_in_package; - there will never be more than 2 packages; - the off-trace cluster will have distance >20 And then HPE shows up with a machine that matches the Vendor-Family-Model checks but looks like this: Here's an 8 socket (2 chassis) HPE system with SNC enabled: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 10 = Same chassis and socket 12 = Same chassis and socket (SNC) 16 = Same chassis and adjacent socket 18 = Same chassis and non-adjacent socket 40 = Different chassis Turns out, the 'max 2 packages' thing is only relevant to the SNC-3 parts, the smaller parts do 8 sockets (like usual). The above SLIT table is sane, but violates the previous assumptions and trips a WARN. Now that the topology code has a sensible measure of nodes-per-package, we can use that to divinate the SNC mode at hand, and only fix up SNC-3 topologies. There is a 'healthy' amount of paranoia code validating the assumptions on the SLIT table, a simple pr_err(FW_BUG) print on failure and a fallback to using the regular table. Lets see how long this lasts :-) Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode") Reported-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110100.238361290@infradead.org
2026-03-04x86/topo: Replace x86_has_numa_in_packagePeter Zijlstra
.. with the brand spanking new topology_num_nodes_per_package(). Having the topology setup determine this value during MADT/SRAT parsing before SMP bringup avoids having to detect this situation when building the SMP topology masks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110100.123701837@infradead.org
2026-03-04x86/topo: Add topology_num_nodes_per_package()Peter Zijlstra
Use the MADT and SRAT table data to compute __num_nodes_per_package. Specifically, SRAT has already been parsed in x86_numa_init(), which is called before acpi_boot_init() which parses MADT. So both are available in topology_init_possible_cpus(). This number is useful to divinate the various Intel CoD/SNC and AMD NPS modes, since the platforms are failing to provide this otherwise. Doing it this way is independent of the number of online CPUs and other such shenanigans. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110100.004091624@infradead.org
2026-03-04x86/numa: Store extra copy of numa_nodes_parsedPeter Zijlstra
The topology setup code needs to know the total number of physical nodes enumerated in SRAT; however NUMA_EMU can cause the existing numa_nodes_parsed bitmap to be fictitious. Therefore, keep a copy of the bitmap specifically to retain the physical node count. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110059.889884023@infradead.org
2026-03-04x86/boot: Handle relative CONFIG_EFI_SBAT_FILE file pathsJan Stancek
CONFIG_EFI_SBAT_FILE can be a relative path. When compiling using a different output directory (O=) the build currently fails because it can't find the filename set in CONFIG_EFI_SBAT_FILE: arch/x86/boot/compressed/sbat.S: Assembler messages: arch/x86/boot/compressed/sbat.S:6: Error: file not found: kernel.sbat Add $(srctree) as include dir for sbat.o. [ bp: Massage commit message. ] Fixes: 61b57d35396a ("x86/efi: Implement support for embedding SBAT data for x86") Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/f4eda155b0cef91d4d316b4e92f5771cb0aa7187.1772047658.git.jstancek@redhat.com
2026-03-03x86/PVH: Use boot params to pass RSDP address in start_info pageHou Wenlong
After commit e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP address from boot params if available"), the RSDP address can be passed in boot params. Therefore, store the RSDP address in start_info page into boot params in the PVH entry instead of registering a different callback. This removes an absolute reference during the PVH entry and is more standardized. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <76675c4d49d3a8f72252076812ef8f22276230c2.1772282441.git.houwenlong.hwl@antgroup.com>
2026-03-03x86/xen: update outdated commentkexinsun
The function xen_flush_tlb_others() was renamed xen_flush_tlb_multi() by commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs concurrently"). Update the comment accordingly. Signed-off-by: kexinsun <kexinsun@smail.nju.edu.cn> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260224022424.1718-1-kexinsun@smail.nju.edu.cn>