-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Fix stack overflow issue due to nested interrupts #11276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
CpuDxe driver uses a global C variable to record the interrupt state. The state variable is updated every time CpuArch.EnableInterrupt() or CpuArch.DisableInterrupt() is called. CpuArch.GetInterruptState() simply returns the state variable. But when CpuArch.GetInterruptState() is called in the interrupt context, even the interrupt state is enabled before interrupt happens, because the interrupt is not disabled through CpuArch.DisableInterrupts(), CpuArch.GetInterruptState() still returns that the interrupt state is enabled. It's not correct. The commit removes the C global variable and always reads the interrupt state from CPU register. Signed-off-by: Ray Ni <[email protected]>
This is a heavily simplified version of the fix in the OvmfPkg NestedInterruptTplLib. Putting it in DXE core allows CoreRestoreTpl() to lower the TPL while keeping interrupts disabled, removing the need for either DisableInterruptsOnIret() or the complex deferred execution mechanism. Instead, CoreRaiseTpl() uses the current state of the interrupt flag to second guess whether it's being called from an interrupt handler; when restoring the outer TPL at the end of the handler, interrupts remain disabled until IRET. This eliminates the possibility that a nested invocation of the interrupt handler has the same TPL as the outer one. Signed-off-by: Michael D Kinney <[email protected]> Signed-off-by: Ray Ni <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
…ectly Because gBS.Raise/RestoreTPL have been fixed to handle the nested timer interrupts. Signed-off-by: Ray Ni <[email protected]> Signed-off-by: Wei6 Xu <[email protected]>
Because gBS.Raise/RestoreTPL have been fixed to handle the nested timer interrupts, NestedInterruptTplLib is now redundant, remove it entirely from OvmfPkg. Signed-off-by: Wei6 Xu <[email protected]>
|
Filling the description with some actual content would be very nice, especially on testing. The patches look good to me. @mcb30 FYI |
|
Looks plausible to me. Has it been tested in the two use cases that originally caused problems?
|
I will try the above two cases and update the result here later. |
|
The generic changes in this PR affect all architectures, not just X86. |
| // | ||
| // Bit mask of TPLs that were interrupted (typically during RestoreTPL's | ||
| // event dispatching, though there are reports that the Windows boot loader | ||
| // executes stray STIs at TPL_HIGH_LEVEL). CoreRaiseTpl() sets the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please avoid X86-specific lingo here? This code is shared by all architectures, and terms like STI and IRET may not be familiar to everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. we could just say "Windows boot loader disables CPU interrupt at TPL_HIGH_LEVEL"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. we could just say "Windows boot loader disables CPU interrupt at TPL_HIGH_LEVEL"
Enables, not disables. Interrupts are supposed to be disabled at TPL_HIGH_LEVEL, and EDK2 assumes that this invariant will always hold. The Windows bug is that it first raises to TPL_HIGH_LEVEL (which is already illegal: that level is specified to be "for use exclusively by the firmware") and then enables interrupts (which breaks the firmware's invariant assumption).
| // Within an interrupt handler. Save the TPL that was interrupted; | ||
| // It must be higher than the previously interrupted TPL, since | ||
| // CoreRestoreTpl reset all bits up to and including the requested TPL. | ||
| // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are conflating interrupts are disabled with running within an interrupt handler here. This needs justification, because not all CPU arch protocol implementations may behave the same in this regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes that the CPU interrupt is disabled as soon as the interrupt handler runs.
So, when interrupts are disabled, the code assumes running in the interrupt context.
Can you explain what else the implementation could be? Or suggest a better change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I reread the commit message, I found it could be a bit easier to understand if reworded as follows:
X86 CpuDxe driver uses a global C variable to record the interrupt state.
The global C variable is updated every time CpuArch.EnableInterrupt() or
CpuArch.DisableInterrupt() is called.
CpuArch.GetInterruptState() returns the global C variable value.
But CpuArch.GetInterruptState() incorrectly returns TRUE when it is called
in the interrupt context.
The reason is that the C global variable indicating the interrupt state
remains TRUE after the interrupt is disabled by CLI from the CpuExceptionHandlerLib.
The commit removes the C global variable and always reads the
interrupt state from the CPU register (EFLAGS).
| // | ||
| // Bit mask of TPLs that were interrupted (typically during RestoreTPL's | ||
| // event dispatching, though there are reports that the Windows boot loader | ||
| // executes stray STIs at TPL_HIGH_LEVEL). CoreRaiseTpl() sets the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. we could just say "Windows boot loader disables CPU interrupt at TPL_HIGH_LEVEL"
| // Within an interrupt handler. Save the TPL that was interrupted; | ||
| // It must be higher than the previously interrupted TPL, since | ||
| // CoreRestoreTpl reset all bits up to and including the requested TPL. | ||
| // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes that the CPU interrupt is disabled as soon as the interrupt handler runs.
So, when interrupts are disabled, the code assumes running in the interrupt context.
Can you explain what else the implementation could be? Or suggest a better change?
|
@eblipp please stop pasting LLM generated drivel into random PRs |
|
Looks okay for Arm, besides the already raised issue around x86isms in comments. |
|
@xuweiintel Please fix CI issues. @niruiyu are any more changes required to merge? |
Thank you. How do we tell in this test that an interrupt actually occurred between |
|
The testing seems quite sketchy to me, compared to the testing that was done for NestedInterruptTplLib. I would like to have some confidence that:
Otherwise, the testing documented so far is just at the level of "it seems to work ok", which is really not sufficient for something with this level of subtlety and complexity. |
|
@mcb30 Do you have a link to the testing that was done for the NestedInterruptTplLib so we can rerun the same tests? I think we should also consider some unit tests to make sure the scenarios in Case 2 are always covered so we never introduce a regression. For Case 1, USB Network Stack, is there a link to the root cause analysis for that scenario, so we can potentially implement unit tests for that general case as well? |
It's unfortunately all handwritten on paper. I'm happy to share copies of the notebooks, if it would help.
There's a very long conversation about it in the original bugzilla, which is now at #9256 |







Description
<Include a description of the change and why this change was made.>
<For each item, place an "x" in between
[and]if true. Example:[x](you can also check items in GitHub UI)><Create the PR as a Draft PR if it is only created to run CI checks.>
<Delete lines in <> tags before creating the PR.>
How This Was Tested
<Describe the test(s) that were run to verify the changes.>
Integration Instructions
<Describe how these changes should be integrated. Use N/A if nothing is required.>