Skip to content

Conversation

@Kasra-G
Copy link
Contributor

@Kasra-G Kasra-G commented Jun 19, 2025

Issue # (if applicable)

Closes #35390.

Reason for this change

When using the DistributedMap state, the state machine cannot be redriven without additional configuration from the user. These additional steps seem to sometimes lead users into circular dependency issues (see example).

Additionally, the DistributedMap assigns incorrect permissions for the states:StopExecution and states:DescribeExecution as these need to be applied to the execution arn, not the state machine arn (AWS Docs vs code reference.

This PR also fixes an inconsistency for the recently added grantRedriveExecution function as it does not work for state machines that have DistributedMaps.
The root cause is that when adding a DistributedMap to a state machine, additional permissions need to be added to the state machine's execution role to permit redriving a map run. Otherwise, when the state machine is redriven, the map run will fail to redrive. The docs provide a minimum permissions example https://docs.aws.amazon.com/step-functions/latest/dg/iam-policies-eg-dist-map.html#iam-policy-redrive-dist-map
The key permission is states:RedriveExecution on the following arn:
arn:aws:states:us-east-2:123456789012:execution:myStateMachineName/myMapRunLabel:*

However, when creating a state machine manually in the AWS Console, if there are any unlabeled distributed maps, the following policy is generated for the generated state machine role:
arn:aws:states:us-east-2:123456789012:execution:myStateMachineName/*:*

This change mirrors that logic in CDK.

Description of changes

Minor: fix map-with-catch integ test failing because it was not emptying the bucket before attempting to delete it

  • Updated the permissions granted in the bind method for DistributedMaps to include correct redrive permissions
  • Fixed incorrect resource on the states:StopExecution and states:DescribeExecution permissions
  • Fix missing DistributedMap policy statements when the DistributedMap state was in a child StateGraph such as a Parallel.branch This will be handled in a separate PR.

These changes fix the root cause of grantRedriveExecution not working for state machines with DistributedMaps

Describe any new or updated permissions being added

when there is any unlabeled distributed map run state

  action: ["states:DescribeExecution", "states:StopExecution"]
- resource: "arn:aws:states:us-east-2:account-id:stateMachine:myStateMachine:*"
+ resource: "arn:aws:states:us-east-2:account-id:execution:myStateMachine:*"
//...
+ action: "states:RedriveExecution"
+ resource: "arn:aws:states:us-east-2:account-id:execution:myStateMachine/*:*"

when there are only labeled distributed map runs

  action: ["states:DescribeExecution", "states:StopExecution"]
- resource: "arn:aws:states:us-east-2:account-id:stateMachine:myStateMachine:*"
+ resource: "arn:aws:states:us-east-2:account-id:execution:myStateMachine:*"
//...
+ action: "states:RedriveExecution"
+ resource: [
+   "arn:aws:states:us-east-2:account-id:execution:myStateMachine/myLabel1:*", 
+   "arn:aws:states:us-east-2:account-id:execution:myStateMachine/myLabel2:*"
+ ]

Description of how you validated changes

Unit & Integration tests

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added bug This issue is a bug. effort/medium Medium work item – several days of effort p2 labels Jun 19, 2025
@aws-cdk-automation aws-cdk-automation requested a review from a team June 19, 2025 07:56
@github-actions github-actions bot added the beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK label Jun 19, 2025
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This review is outdated)

@Kasra-G Kasra-G changed the title fix(stepfunctions): Fix incorrect/missing DistributedMap permissions to run/redrive state machines fix(stepfunctions): fix incorrect/missing DistributedMap permissions to run/redrive state machines Jun 19, 2025
@aws-cdk-automation aws-cdk-automation dismissed their stale review June 19, 2025 08:04

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@Kasra-G Kasra-G changed the title fix(stepfunctions): fix incorrect/missing DistributedMap permissions to run/redrive state machines fix(stepfunctions): incorrect/missing DistributedMap permissions to run/redrive state machines Jun 19, 2025
@Kasra-G Kasra-G force-pushed the ft/step-functions/distributed-map-redrive-permissions branch from 3720ec0 to d65f21d Compare June 19, 2025 08:12
@Kasra-G Kasra-G changed the title fix(stepfunctions): incorrect/missing DistributedMap permissions to run/redrive state machines fix(stepfunctions): incorrect/missing permissions to run/redrive state machines with DistributedMap Jun 19, 2025
@Kasra-G Kasra-G changed the title fix(stepfunctions): incorrect/missing permissions to run/redrive state machines with DistributedMap fix(stepfunctions): incorrect/missing permissions to run/redrive DistributedMap in state machine Jun 19, 2025
@Kasra-G Kasra-G marked this pull request as ready for review June 19, 2025 08:29
@aws-cdk-automation aws-cdk-automation added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Jun 19, 2025
@Kasra-G

This comment was marked as outdated.

@Kasra-G Kasra-G force-pushed the ft/step-functions/distributed-map-redrive-permissions branch 2 times, most recently from 8b6afd4 to 6d8438c Compare August 1, 2025 06:01
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 6d8438c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kumvprat kumvprat self-assigned this Aug 22, 2025
Copy link
Contributor

@kumvprat kumvprat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kasra-G Thanks for the contribution, this seems to be a niche one
I will add inline comments for this

Since it has been a while since the issue was reported, I am not sure if having the RedriveExecution policy on the DistributedMap state type is a default expectation(maybe it's called out in the documentation and I missed it, if so would be nice to link it in the PR description)

There are integration tests which cover the basic distributed map functionality : packages/@aws-cdk-testing/framework-integ/test/aws-stepfunctions/test/integ.distributed-map.ts which were working before(?)without explictly adding the RedriveExecution based policy. (trying to make a case for not needing the RedriveExecution policy by default)

const distributedMapPolicy = new iam.Policy(stateMachine, 'DistributedMapPolicy');
stateMachine.grantStartExecution(distributedMapPolicy);
stateMachine.grantExecution(distributedMapPolicy, 'states:DescribeExecution', 'states:StopExecution');
stateMachine.grantRedriveExecution(distributedMapPolicy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the linked issue, it seems like the StartExecution, StopExecution and DescribeExecution permissions are not enough for the DistributedMap state to work and it needs the RedriveExecution permission also to function porperly. Is this summarization correct ? (comment link)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that the faulty policy before was the reason as it had stateMachine arn instead of execution arn :

new iam.PolicyStatement({
                actions: ['states:DescribeExecution', 'states:StopExecution'],
                resources: [`${stateMachine.stateMachineArn}:*`],
              }),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartExecution, StopExecution and DescribeExecution permissions are not enough for the DistributedMap state to work and it needs the RedriveExecution permission also to function porperly

Correct, it also needs RedriveExecution, otherwise, when the state machine is redriven, it will fail to redrive the map run.

it could be that the faulty policy before was the reason as it had stateMachine arn instead of execution arn

This might have contributed to other users facing errors, but its not quite clear to me what the effect of the incorrect permissions are. The docs state you need the permissions on the execution and not the state machine, so that change was made. The generated permissions in the console is not helpful in this regard as the resources being granted to is just * for the DescribeExecution permissions.

@Kasra-G Kasra-G force-pushed the ft/step-functions/distributed-map-redrive-permissions branch from 6d8438c to 9efce58 Compare August 30, 2025 10:18
@Kasra-G
Copy link
Contributor Author

Kasra-G commented Aug 30, 2025

Thanks for taking a look at the PR & the review. I have updated it and have a few responses:

Since it has been a while since the issue was reported, I am not sure if having the RedriveExecution policy on the DistributedMap state type is a default expectation(maybe it's called out in the documentation and I missed it, if so would be nice to link it in the PR description)

I have updated the PR description with link to AWS docs talking about the permissions needed when adding a DistributedMap to a state machine.

There are integration tests which cover the basic distributed map functionality : packages/@aws-cdk-testing/framework-integ/test/aws-stepfunctions/test/integ.distributed-map.ts which were working before(?)without explictly adding the RedriveExecution based policy.

Yes, those integ tests work, but they do not test the redrive capability. That is why I have added a new integration test that tests the redrive capability for a state machine with a distributed map.

(trying to make a case for not needing the RedriveExecution policy by default)

The redrive permissions are added when creating a distributed map in a state machine via the AWS Console (and selecting the option to create a new role), but not in CDK

Just an additional note, I updated the new integ test integ.distributed-map-redrive to be a bit more comprehensive and test both distributedMap with a label and without a label, since the permissions are different.

Please let me know if there's any questions

@Kasra-G

This comment was marked as resolved.

@Kasra-G
Copy link
Contributor Author

Kasra-G commented Aug 30, 2025

I also have changes that would fix the same underlying issue as #29913 before it was closed for inactivity. I was going to open another PR for that one after this one, but since the changes conflict, I can merge the changes into the same PR to reduce overhead - just let me know.

I have decided to add on these changes to this PR, let me know if it should be a separate PR.

});
}),

test('Instantiate State Machine With Self Referencing Distributed Map State', () => {
Copy link
Contributor

@kumvprat kumvprat Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing that we are adding new tests : Are these possibilities allowed, like a self-referencing distributedmap state ?

It seems counter-intuitive that this use of the sfns will be helpful but if that is the case a few lines of comment above the test around when it's possible will be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wont be the judge of if it is useful to customers but the console does allow self referencing distributed maps. Its certainly possible when combined with a Choice state, for example, to keep looping a state until some condition is met and break out the loop. The test ensures CDK supports this for DistributedMap (as it should any other state)

image

I can add a comment that explains why we are testing it, (to make sure theres no infinite loop in the BFS), but I won't explain why customers would do it; that seems out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, we do want to point out why we are testing it. Customers would be free to use the functionality if it's available.

A comment on the test would help us understand in future why is this being tested out

@Kasra-G Kasra-G force-pushed the ft/step-functions/distributed-map-redrive-permissions branch from 3e3e150 to a7544b9 Compare September 2, 2025 04:02
@Kasra-G
Copy link
Contributor Author

Kasra-G commented Sep 2, 2025

@kumvprat I have removed the fix for DistributedMap policies not showing up in nested StateGraphs and will raise a new PR as requested for that once when this one gets merged. Let me know if we actually want to put those changes in this PR.

I will change the linked issue to a new one that is not the main discussion thread, as I do not want that thread accidentally getting resolved when there are still pending changes.

Copy link
Contributor

@kumvprat kumvprat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kasra-G Are the changes to just the index.js for aws-lambda-python-alphs and aws-ses packages integ tests needed ? The index.js is changing, which is an asset but there are no corresponding changes related to the asset elsewhere(like maybe the asset hash that changes when being used in the template)

@Kasra-G
Copy link
Contributor Author

Kasra-G commented Sep 3, 2025

Are the changes to just the index.js for aws-lambda-python-alphs and aws-ses packages integ tests needed

Probably not. I think those integ tests failed on my first version of the PR so I ran them to update the files but there wasn't really any changes. I'll try resetting those changes. I'm actually still getting CHANGED integ tests in the aws-lambda-python-alpha in my local but I will just ignore those, probably something wrong with my local environment.

@Kasra-G
Copy link
Contributor Author

Kasra-G commented Sep 3, 2025

Apologies, just noticed I have committed snapshot files for an integ test that I renamed all the way back in my first or second pr revision. I will remove those snapshot files.

@Kasra-G Kasra-G requested a review from kumvprat September 3, 2025 19:06
Copy link
Contributor

@kumvprat kumvprat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for detailed work on the PR

@mergify
Copy link
Contributor

mergify bot commented Sep 4, 2025

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify
Copy link
Contributor

mergify bot commented Sep 4, 2025

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit bbebb79 into aws:main Sep 4, 2025
19 checks passed
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK bug This issue is a bug. effort/medium Medium work item – several days of effort needs-security-review Related to feature or issues that needs security review p2 pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stepfunction: DistributedMap incorrect Describe/StopExecution and missing RedriveExecution permissions on state machine role

3 participants