Skip to content

Conversation

@limbonaut
Copy link
Contributor

@limbonaut limbonaut commented May 25, 2025

Note: The issue with the original implementation was that it was crashing in coroutine tests -- it's resolved here.

Merging this will help Sentry SDK to display accurate stack traces to users.

Coroutine issue explanation

Why it happens

  1. Function is entered, awaited, and therefore exited (we're not interested in details at this point).
  2. Later, the function is resumed with a function-state object (now we're interested in details).
  3. Inside call(): line variable is local, and enter_function() is passed a reference to this variable.
  4. enter_function() initializes CallLevel with a pointer to the line variable (which is scoped to call()).
  5. The line variable is continuously updated during GDScript function execution using OPCODE_LINE operation.
  6. The GDScript function execution ends, call() is exited, and line is popped off the stack, but the CallLevel instance still lives on due to specifics of the resumed function execution (exit_function() isn't called yet).
  7. The next function-state object is resumed via signal, and steps 2–6 repeat until no function-state objects remain in the resume chain.
  8. Resumed functions finally exit, but the CallLevel frames overstayed the lifetime of their line variables.

Conclusion: Because CallLevel in GDScript call stack outlives call()'s scope for resumed coroutines, CallLevel* and int *line pointers can sometimes point to heaven-knows-what.

Note: This PR ensures that function-state execution remains within the call() scope, and doesn't cause issues with CallLevel being freed before it's removed from the GDScript call stack.

Explaining on example

Consider this code:

func _ready() -> void:
	await f1()
	for b in Engine.capture_script_backtraces(): print(b)

func f1():
	print("starting f1")
	await f2()
	print("finishing f1")

func f2():
	print("starting f2")
	await f3()
	print("finishing f2")

func f3():
	print("starting f3")
	await get_tree().process_frame
	print("finishing f3")

Output:

Note: Lines starting with "debug: " are print_line() statements I placed in the native code.

debug: call( @implicit_new ): entering
debug: enter_function( @implicit_new )    [ call_level.line: 0 ]
debug: exit_function( @implicit_new )    [ call_level.line: 0 ]
debug: call( @implicit_new ): exiting
debug: call( _ready ): entering
debug: enter_function( _ready )    [ call_level.line: 4 ]
debug: call( f1 ): entering
debug: enter_function( f1 )    [ call_level.line: 9 ]
starting f1
debug: call( f2 ): entering
debug: enter_function( f2 )    [ call_level.line: 15 ]
starting f2
debug: call( f3 ): entering
debug: enter_function( f3 )    [ call_level.line: 21 ]
starting f3
debug: exit_function( f3 )    [ call_level.line: 23 ]
debug: call( f3 ): exiting
debug: exit_function( f2 )    [ call_level.line: 17 ]
debug: call( f2 ): exiting
debug: exit_function( f1 )    [ call_level.line: 11 ]
debug: call( f1 ): exiting
debug: exit_function( _ready )    [ call_level.line: 5 ]
debug: call( _ready ): exiting
debug: call( f3 ): entering
debug: enter_function( f3 )    [ call_level.line: 23 ]
finishing f3
debug: call( f3 ): exiting
debug: call( f2 ): entering
debug: enter_function( f2 )    [ call_level.line: 17 ]
finishing f2
debug: call( f2 ): exiting
debug: call( f1 ): entering
debug: enter_function( f1 )    [ call_level.line: 11 ]
finishing f1
debug: call( f1 ): exiting
debug: call( _ready ): entering
debug: enter_function( _ready )    [ call_level.line: 5 ]
GDScript backtrace (most recent call first):
    [0] _ready (res://main.gd:6)
    [1] f1 (res://main.gd:32765)
    [2] f2 (res://main.gd:32765)
    [3] f3 (res://main.gd:32765)
debug: call( _ready ): exiting
debug: exit_function( _ready )    [ call_level.line: 6 ]
debug: exit_function( f1 )    [ call_level.line: 32765 ]
debug: exit_function( f2 )    [ call_level.line: 32765 ]
debug: exit_function( f3 )    [ call_level.line: 32765 ]

Notice that the backtrace reports crazy line numbers, and how call(f1) already exited, but the associated CallLevel frame lingers slightly longer in the end (exit_function(f1)). This only happens for resumed functions.

@mihe
Copy link
Contributor

mihe commented May 26, 2025

Cool stuff! I spent days trying to find some fix for that issue with the linked list approach in #91006, with various invocations of GDScriptLanguage::exit_function(), but just found myself with more cryptic crashes.

I can't say I understand all the ins-and-outs of this alternative approach with the coroutine signal handover, so I'll leave that for the GDScript people to review, but seeing as how all the unit tests pass and whatnot, this looks promising.

You'll probably want to change the documentation of the debug/settings/gdscript/always_track_call_stacks project setting while you're at it:

<member name="debug/settings/gdscript/always_track_call_stacks" type="bool" setter="" getter="" default="false">
Whether GDScript call stacks will be tracked in release builds, thus allowing [method Engine.capture_script_backtraces] to function.
Enabling this comes at the cost of roughly 40 KiB for every thread that runs any GDScript code.
[b]Note:[/b] This setting has no effect on editor builds or debug builds, where GDScript call stacks are tracked regardless.
</member>

EDIT: I just realized you removed the track_call_stack check from GDScriptLanguage::enter_function and GDScriptLanguage::exit_function, so I guess this project setting is in a bit of a weird state at the moment, where the call stacks are in fact always tracked regardless of build, but OPCODE_LINE won't always be used, so you won't always have accurate line numbers?

@limbonaut
Copy link
Contributor Author

limbonaut commented May 26, 2025

I just realized you removed the track_call_stack check from GDScriptLanguage::enter_function and GDScriptLanguage::exit_function

Oh, I probably removed that check by accident, as I was thinking of removing the option altogether, but it might be too controversial a change. I'm not sure how open the team is to enabling script traces for all builds by default.

UPDATE:
Fixed it -- the checks are back.

@limbonaut limbonaut force-pushed the linked-call-stack branch from db41873 to 53caa04 Compare May 26, 2025 10:09
@limbonaut
Copy link
Contributor Author

Just realized, since CallLevel now resides on the call() stack and is 48 bytes, its memory cost is incurred regardless of whether the track_call_stack option is enabled. Although the total memory used depends on the actual GDScript call stack size rather than the maximum. The option still affects whether the OPCODE_LINE instructions are added.

@limbonaut limbonaut force-pushed the linked-call-stack branch from 53caa04 to 9327cbf Compare May 26, 2025 11:01
@limbonaut limbonaut requested a review from a team as a code owner May 26, 2025 11:01
@limbonaut
Copy link
Contributor Author

limbonaut commented May 26, 2025

I found an issue with this implementation. Since I removed GDFunctionState::first_state, there's nothing keeping first_state alive because it's a RefCounted, so the "completed" signal sometimes gets called on a freed instance of GDFunctionState. I'll make adjustments.

UPDATE: Done.

@limbonaut limbonaut force-pushed the linked-call-stack branch from 9327cbf to db9b9b6 Compare May 26, 2025 14:01
@akien-mga akien-mga requested a review from vnen June 16, 2025 00:07
Copy link
Member

@vnen vnen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with the call stack code and the coroutine code is quite complex so it's always hard to review. But the code does look okay to me.

We probably need to add proper test for coroutines to guarantee they work as expected, but this is a topic for another day.


if (unlikely(_call_stack.stack_pos >= _debug_max_call_stack)) {
if (unlikely(_call_stack_size >= _debug_max_call_stack)) {
// stack overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to avoid comments like this. The error message is already saying "Stack overflow", so this is very redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense – I removed the comment.

@Mickeon Mickeon removed the request for review from a team June 29, 2025 23:59
@limbonaut limbonaut force-pushed the linked-call-stack branch from db9b9b6 to 0d2a6c3 Compare June 30, 2025 07:43
* GDScript call stack as reverse linked list with issues fixed
(originally proposed in 91006).
* Fix coroutine issues with call stack by resuming async call chain
inside `GDScriptFunction::call()`.
* This fixes corrupted line numbers for coroutines in the debugger and
backtrace (106489).

Co-authored-by: Juan Linietsky <[email protected]>
@limbonaut limbonaut force-pushed the linked-call-stack branch from 0d2a6c3 to a095c5e Compare June 30, 2025 09:38
@Repiteo Repiteo merged commit 75751c0 into godotengine:master Jul 1, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Jul 1, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect line numbers for coroutines in GDScript stack traces

6 participants