Skip to content

Conversation

andyross
Copy link
Contributor

@andyross andyross commented Feb 5, 2025

  1. Mostly complete. Supports MPU, userspace, PSPLIM-based stack guards, and FPU/DSP features. ARMv8-M secure mode "should" work but I don't know how to test it.

  2. Designed with an eye to uncompromising/best-in-industry cooperative context switch performance. No PendSV exception nor hardware stacking/unstacking, just a traditional "musical chairs" switch. Context gets saved on process stacks only instead of split between there and the thread struct. No branches in the core integer switch code (and just one in the FPU bits that can't be avoided).

  3. Minimal assembly use; arch_switch() itself is ALWAYS_INLINE, there is an assembly stub for exception exit, and that's it beyond one/two instruction inlines elsewhere.

  4. Selectable at build time, interoperable with existing code. Just use the pre-existing CONFIG_USE_SWITCH=y flag to enable it. Or turn it off to evade regressions as this stabilizes.

  5. Exception/interrupt returns in the common case need only a single C function to be called at the tail, and then return naturally. Effectively "all interrupts are direct now". This isn't a benefit currently because the existing stubs haven't been removed (see # 4), but in the long term we can look at exploiting this. The boilerplate previously required is now (mostly) empty.

  6. No support for ARMv6 (Cortex M0 et. al.) thumb code. The expanded instruction encodings in ARMv7 are a big (big) win, so the older cores really need a separate port to avoid impacting newer hardware. Thankfully there isn't that much code to port (see # 3), so this should be doable.

@andyross
Copy link
Contributor Author

andyross commented Feb 5, 2025

This is finally looking good enough to submit, let's see how it runs in CI. First, it's important to note that @ithinuel has an entirely different arch_switch() implementation in #85080 that everyone should review too. That one is a relatively straight-line evolution of the current PendSV implementation. This one is (as I'm sure surprises no one) more of a rewrite, using a "normal" context switch. Really I don't see any reason why both shouldn't be able to merge: this will likely take some time to stabilize and we'd want to be maintaining the old stuff in parallel anyway.

The big advantages to this one over that one:

  1. Smaller. I worked really hard to limit code size for performance reasons. And there's more fruit to pick: the thread struct can lose all the still-present slots for the callee-saved registers that now live on the stack, and lots of the legacy fault handlers have boilerplate that now duplicates the exit code that runs out of a regular C handler.

  2. Bigger, heh. Well, more complete. This works with the PSPLIM stack guard feature (which btw: we have very poor test coverage of!) FPU hardware (which was have almost no coverage of, there's only one in-tree qemu FPU platform and it doesn't run in CI). And as I understand the architecture secure mode should ("should") work too, but I don't have a system to test with.

  3. It's actually kinda scary fast, which is what I was hoping to see. The microbenchmark at the end of the series is showing about 60% improvement in z_swap() on my FRDM-K64F vs. the current tree (just z_swap though, not all the other stuff!). It's tuned heavily for the common case of cooperative switching, using a custom entirely-on-process-stack frame format for suspended threads and not the one the hardware emits (there's a conversion step when threads switch on interrupt/exception exit).

  4. Legacy-free. No more ARCH_HAS_CUSTOM_SWAP_TO_MAIN or ARCH_HAS_THREAD_ABORT (and especially no more SWAP_NONATOMIC!), nor a custom arch_thread_return_value_set(). ARM Cortex M as of this patch looks like a "standard" Zephyr platform without any magic.

  5. Minimal impact on existing code. The new context layer is in two new files with only ~130 lines of changes to existing code.

Copy link
Contributor

@wearyzen wearyzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still going through the changes but overall I think using cmsis apis would improve readability and address concerns like having isb at required places.
There seems to be a older comments that are not addressed yet from ithinuel and others, could you please have a look and reply to those as well?

For the issue you see with Corstone310/315/320 FVP boards, I don't think this should block the current PR so lets go with this change for now and raise a github issue assigned to me. I will start looking at it as soon as I am done with review of this PR.

@github-project-automation github-project-automation bot moved this from 4.3 to 3.7 (LTS3) in Release Plan Aug 14, 2025
@jhedberg jhedberg moved this from 3.7 (LTS3) to 4.3 in Release Plan Aug 15, 2025
@andyross
Copy link
Contributor Author

Some notes on the ISB pedantry. :) Basically "exactly who demands this and in what specification?"

And the CMSIS stuff is of a piece. I have a comment above from the spring about the same subject, but he core reason that I don't like HAL code is that it's filled with needless conservatism like this and semi-voodoo proscriptions that don't align with the documented behavior of the hardware. And especially that in general the HAL documentation is much worse than the ISA documentation is. I can read a formal behavior of BASEPRI and MSR and CONTROL in the architecture spec that tells me exactly what the instruction is going to do, right down to the level of interpretable pseudo code, and tell me how many pipeline stages the following instruction will be delayed. At best things like __set_CONTROL give you a general sense of what's going to happen, but often leave out critically important details (like in this case, "Oh, yeah, it does a pipeline stall too).

Basically I've been burned repeatedly by HAL layers. CMSIS is surely better than Xtensa, but I still don't trust it.

@andyross
Copy link
Contributor Author

But that said, I'm fine if someone wants to come by later with a cleanup pass pointing out that this or that assembly sequence can be done with fewer instructions or whatever with CMSIS. I'm not a zealot. I'm just saying that given the choice between following instructions left by a software developer or a hardware engineer, I'm going to pick the hardware team every single time.

@ljd42
Copy link
Contributor

ljd42 commented Aug 16, 2025

Some notes on the ISB pedantry. :) Basically "exactly who demands this and in what specification?"

I can follow your line of thoughts, Andy! I've been taught in my ARM training "change CONTROL, use ISB". But the more I look to it, the more I see it as "a cooking recipe" to be on the safe-side for the case one indeed needs an ISB. And if one doesn't, one just 'wastes some cycles' and this is often seen as the least evil for most applications. Of course, here it matters!

I checked Joseph Yiu's definitive guide for the Cortex M23/33, which has a dedicated chapter on "OS support features". For the context switch code, he mentions "executes an ISB after CONTROL updates (architecture recommendation)", but without any further explanation. It would be interesting to see what he wrote on the previous book dealing with the M3/M4. While the Armv7-M Architecture Reference is quite detailed about ISB uses, the Armv8-M remains rather elusive on the topic. Or at least, that's how feel, perhaps I'm not looking at the right place.

There are interesting aspects to be presented and discussed about this PR. I'm really looking forward to your talk at the ZDS@Amstedam!

@JordanYates
Copy link
Contributor

JordanYates commented Aug 17, 2025

Tangential issue relating to consequences, not the implementation.
I'm guessing that since the storage format of frames on the stack no longer matches the architecture default, debuggers won't be able to generate/parse backtraces without updates?
e.g.: https://github.com/zephyrproject-rtos/jlink-zephyr
https://github.com/zephyrproject-rtos/jlink-zephyr/blob/f06eb4ed7e91ada9b5a0532ca0c9647ee3da31bd/zephyr_plugin.c#L112

@andyross
Copy link
Contributor Author

andyross commented Aug 18, 2025

debuggers won't be able to generate/parse backtraces without updates?

They will during an exception, the pickling doesn't happen until context switch. If you want to inspect an OS-suspended thread, then yes, an adaptation layer will be needed to do the conversion. The routines are there, something just needs to call them in the right spots.

Edit: actually does that even work right now? Every hardware debugger interface I've used only treats with CPU state, where there are multiple "threads" in gdb in these setups they reflect SMP contexts and not OS threads. Does J-Link have the intelligence to make this work?

@mmahadevan108
Copy link
Contributor

@hakehuang is it possible to test this PR on NXP boards with a full test cycle run.

@ljd42
Copy link
Contributor

ljd42 commented Aug 29, 2025

@hakehuang is it possible to test this PR on NXP boards with a full test cycle run.

I tested on the MCXN947. It would be awesome to have more NXP boards!

@JarmouniA
Copy link
Contributor

JarmouniA commented Aug 29, 2025

For anyone who wants to test this PR on a test bench:

west twister --device-testing --hardware-map map.yml -a arm -v -t kernel -t arm -t userspace -t interrupt -t linker -t fpu -t mpu -t trusted-firmware-m -t timing -t benchmark -t cmsis_dsp -t device -t cache -t memory_protection -t threads -t timer

(I didn't find a trivial way to filter for just Cortex-M platforms.)

@hakehuang
Copy link
Contributor

hakehuang commented Aug 31, 2025

@hakehuang is it possible to test this PR on NXP boards with a full test cycle run.

According to regression test on all -T tests/kernel/, all NXP platform runs except below failure. on mimxrt1170/mimxrt1160_evk_cm7


===================================================================
START - test_slice_perthread

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_slice_scheduling.c:159: slice_expired: ((dt - PERTHREAD_SLICE_TICKS) <= TICK_SLOP is false)
slice expired >4 ticks late (dt=63)
 FAIL - test_slice_perthread in 0.033 seconds
===================================================================

@cvinayak
Copy link
Contributor

cvinayak commented Sep 2, 2025

Just me thinking aloud...

BBC Micro Bit board uses nRF51822 SoC with ARM Cortex-M0 at 16 MHz 256 KB Flash and 16 KB RAM. This board (though deprecated by the Vendor) is always a favorite/challenge to have functional simple Bluetooth features (say, peripheral role heart rate service with encrypted connections).

Current upstream main shows Radio ISR latencies of ~ 90 us (observed in #95191) of which ARM Cortex-M0 wakeup with hardware program frame stacking should ideally be 10 us (SoC CPU clock and flash access settling time included). And say, can we get down to the Zephyr ISR vector overheads to be another 10 us (i.e. 16 MHz Cortex-M0) ?

Bluetooth implementations have a hard deadline to process Radio ISRs on on-air packet reception (last bit on-air received to first bit on-air to transmit) as low as 230 us (1M PHY). Having a 90 us Radio ISR latency is a bottleneck today (may be this is a Controller design induced latencies due to other software interrupt at same priority introducing Radio ISR latencies too).

Gradually changes/refactoring to zephyr's primitive implementations have steadily increased latencies (may be gone up and then after some optimizations come down).

Hope, this BBC Micro Bit board can leave longer in the Zephyr Project (truely being an OS for the resource constraint)!

Related: #74345

@jacob-wienecke-nxp
Copy link
Contributor

jacob-wienecke-nxp commented Sep 2, 2025

Is there a test/benchmark in Zephyr that best highlights the performance differences using this PR?

I'd like to test
MCXN947 (M33)
RT1170 (M4)
RT1170 (M7)

EDIT: I ran a quick check using the new benchmark added by this PR:
The latency difference going from CONFIG_USE_SWITCH=n -> CONFIG_USE_SWITCH=y

FRDM_MCXN947 (CM33)
IRQ: +7.06% (slightly slower)
IRQ_P: +22.79% (slower)
SWAP: −18.92% (faster)

RT1170 CM7
IRQ: −53.16% (faster)
IRQ_P: −24.7% (faster)
SWAP: −99.29% (much faster, and the reduced latency seems unreasonable compared to other results)

RT1170 CM4
IRQ: +4.45% (slightly slower)
IRQ_P: +22.11% (slower)
SWAP: −20.0% (faster)

I can't find a better test/benchmark to highlight the performance differences right now.

@ljd42
Copy link
Contributor

ljd42 commented Sep 3, 2025

Hi @jacob-wienecke-nxp

Is there a test/benchmark in Zephyr that best highlights the performance differences using this PR?

tests/arch/arm/arm_switch
added with this PR, it measures specifically the performance for the context switch. @andyross can tell more about it. Current version needs some fixing because it uses psplim, which is only available with the ARMv8 architecture. Workaround suggested by @JarmouniA:

I guarded the PSPLIM stuff in tests/arch/arm/arm_switch/src/main.c with CONFIG_BUILTIN_STACK_GUARD

tests/benchmarks/thread_metric
Used to compare RTOS for marketing purposes and otherwise. Some results on M4/M7/M33 are available here.

tests/benchmarks/latency_measure
Measure various latency, don't know much about it excepts that it's been there for a while.

@wearyzen wearyzen self-requested a review September 3, 2025 21:58
@jhedberg
Copy link
Member

What's the status with this? There seemed to be really good momentum with getting this mergable, but now it's been pretty quiet for three weeks.

@ljd42
Copy link
Contributor

ljd42 commented Sep 24, 2025

@jhedberg : Most of my review comments have been resolved. What is left is the PSPLIM bug in tests/arch/arm/arm_switch/src/main.c, 2 minor issues (erroneous comment/ use of new arch_switch in NS domain), and resolve if the ISB is needed in one specific case.

@andyross
Copy link
Contributor Author

Oops, didn't realize it needs a rebase. Will do that tonight. I think this is probably safe to merge, given the old code remains functional and there are no API changes. And I'll revisit the review comments; I'd convinced myself they were all done, please point me at what's needed.

@wearyzen
Copy link
Contributor

Some notes on the ISB pedantry. :) Basically "exactly who demands this and in what specification?"

In general this is what I followed:
As per Section B3.4 paragraph RSNGJ of the Armv8-M arch ref manual (https://developer.arm.com/documentation/ddi0553/by/?lang=en)

The architecture requires a Context synchronization event to guarantee that a change to the CONTROL register will affect the execution of instructions appearing later in the program order.

From section B1.4.4 of the Armv7-M arch ref manual (https://developer.arm.com/documentation/ddi0403/ee/?lang=en):
Software must use an ISB barrier instruction to ensure a write to the CONTROL register takes effect before the next instruction is executed.

Also in section B1.4.4 of the Armv6-M arch ref manual (https://developer.arm.com/documentation/ddi0419/e/?lang=en)
Software must use an ISB barrier instruction to ensure a write to the CONTROL register takes effect before the next instruction is executed.

And the CMSIS stuff is of a piece. I have a comment above from the spring about the same subject, but he core reason that I don't like HAL code is that it's filled with needless conservatism like this and semi-voodoo proscriptions that don't align with the documented behavior of the hardware. And especially that in general the HAL documentation is much worse than the ISA documentation is. I can read a formal behavior of BASEPRI and MSR and CONTROL in the architecture spec that tells me exactly what the instruction is going to do, right down to the level of interpretable pseudo code, and tell me how many pipeline stages the following instruction will be delayed. At best things like __set_CONTROL give you a general sense of what's going to happen, but often leave out critically important details (like in this case, "Oh, yeah, it does a pipeline stall too).

Basically I've been burned repeatedly by HAL layers. CMSIS is surely better than Xtensa, but I still don't trust it.

CMSIS is widely adopted across Zephyr and its HALs, which shows it's both mature and stable. If there were real issues in its APIs, they'd likely show up broadly across the system, not just in switch code.
That's why we recommend using CMSIS APIs instead of assembly code, it makes the code easier to read, more portable, and much simpler to maintain over time.
But yes, changing the assembly instructions to cmsis apis wherever applicable can be done as part of a follow up PR.

@RobinKastberg
Copy link
Contributor

RobinKastberg commented Sep 25, 2025

I still don't think IAR is working, but I haven't nailed down exactly why.
I am happy to give token and toolchain if you want to test.
Will this be possible to disable in Kconfig for us? I don't want to block this PR in general

Copy link
Contributor

@RobinKastberg RobinKastberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IAR assembler needs this.

*/
__attribute__((naked)) void arm_m_iciit_stub(void)
{
__asm__("udf 0;");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__asm__("udf 0;");
__asm__("udf #0;");

@etienne-lms
Copy link
Contributor

For info, I face some CPU faults on a few test cases, on STM32 boards embedding a Cortex-M33.

  • Running tests/beenchmarks/posix/threads, CPU halts on "undefined instruction" in arm_m_iciit_stub[). Reproduced on stm32 h5/l5/u3/u5/wba55/wba65. Is there something missing in configuration for these platforms to properly handle this exception?
  • Running tests/arch/arm/arm_switch also crashes on these boards at 1st iteration after trace main() switching to my_fn() (iter %d)....

@cfriedt
Copy link
Member

cfriedt commented Oct 8, 2025

@andyross, @wearyzen - please sync up soon (ideally well before feature freeze on Oct 24) to ensure we can get these changes into the next release.

@wearyzen
Copy link
Contributor

wearyzen commented Oct 9, 2025

Hi @andyross, apart from the rebase and IAR issue, there is also this PR #96850 that I think would need to go in first.

@cfriedt cfriedt moved this from 4.3 to 4.4 in Release Plan Oct 14, 2025
@cfriedt
Copy link
Member

cfriedt commented Oct 14, 2025

Targetting Zephyr release 4.4 as per Release WG discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: 4.4

Development

Successfully merging this pull request may close these issues.